Reload nameserver information on lookup failure #41582

jonhoo · 2017-04-27T17:03:44Z

As discussed in #41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

alexcrichton · 2017-04-27T18:17:42Z

I wonder, are there performance implications about this? E.g. does a bunch of failing queries now take much longer?

jonhoo · 2017-04-27T18:47:51Z

@alexcrichton yes, I believe that that is true. This is why I first proposed in #41570 that we might instead want a connect_uncached, or some other way of converting from str to SocketAddr. That said, with this approach, applications for which this is a problem can manually call getaddrinfo, and then use the resulting SocketAddr. Such a workaround does not exist if you do want this behavior, because you can't force the cache to be ignored except by calling res_init.

retep998 · 2017-04-27T19:00:55Z

src/libstd/sys_common/net.rs

+                    // The lookup failure could be caused by using a stale /etc/resolv.conf.
+                    // See https://github.com/rust-lang/rust/issues/41570.
+                    // We therefore force a reload of the nameserver information.
+                    c::res_init();


This code is in sys_common so it is compiled on all platforms, but I don't see a res_init on Windows.

That's what the if cfg!(unix) above was intended for?

Ohh, of course, it's still compiled. I need #[cfg(..)] {} instead, right?

arielb1 · 2017-05-02T09:37:56Z

@alexcrichton - are you looking at this PR? Friendly ping to keep this on your radar.

alexcrichton · 2017-05-02T14:26:41Z

@arielb1 ah yes I am, mostly just caught up in travel!

@jonhoo would you mind benchmarking the effect of calling ::res_init on failed queries? Just to get an idea of how much slower it is.

jonhoo · 2017-05-02T17:33:40Z

@alexcrichton I'd say pretty negligible:

#![feature(test)]
extern crate libc;
extern crate test;

fn main() {}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    use std::net::ToSocketAddrs;

    #[bench]
    fn bench_plain(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0));
    }

    #[bench]
    fn bench_reinit(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| {
                   addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0);
                   unsafe { libc::res_init() };
               });
    }
}

$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:   4,419,043 ns/iter (+/- 331,978)
test tests::bench_reinit ... bench:   4,420,291 ns/iter (+/- 297,022)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

$ sudo netctl stop-all
$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:     267,152 ns/iter (+/- 18,776)
test tests::bench_reinit ... bench:     267,891 ns/iter (+/- 9,855)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

alexcrichton · 2017-05-03T14:39:22Z

@jonhoo awesome thanks for testing!

jonhoo · 2017-05-04T01:24:33Z

@alexcrichton I believe that with rust-lang/libc#585 merged, 1309652 should now work fine. What's the process for building with a more recent libc?

jonhoo · 2017-05-04T04:49:13Z

@alexcrichton not sure if this is the right way to do it, but I bumped the submodule revision for liblibc, and everything seems to be working as it should. r?

alexcrichton · 2017-05-04T06:10:36Z

@bors: r+

Looks good! Let's see what CI says

bors · 2017-05-04T06:10:36Z

📌 Commit db36fc8 has been approved by alexcrichton

bors · 2017-05-04T06:10:46Z

⌛ Testing commit db36fc8 with merge 31b2a0a...

@alexcrichton

…excrichton Reload nameserver information on lookup failure As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Fixes #41570. Depends on rust-lang/libc#585. r? @alexcrichton

bors · 2017-05-04T06:24:07Z

💔 Test failed - status-travis

TimNN · 2017-05-04T06:25:51Z

Looks legitimate (Edit: this is on macOS, eg: https://travis-ci.org/rust-lang/rust/jobs/228616549):

[00:02:40]   = note: Undefined symbols for architecture x86_64:
[00:02:40]             "_res_9_init", referenced from:
[00:02:40]                 std::net::lookup_host::hd3d9eaad793f8abc in std-438eba4cd7d88a45.0.o
[00:02:40]           ld: symbol(s) not found for architecture x86_64
[00:02:40]           clang: error: linker command failed with exit code 1 (use -v to see invocation)

kennytm · 2017-05-04T07:59:47Z

Problem on libc side. Using res_init on macOS requires linking with -lresolv also.

jonhoo · 2017-05-04T15:33:02Z

@kennytm yeah, that's what I was worried about. @alexcrichton, looks like we can't rely on std to pull in resolv. How do you want us to fix this? Add linking with resolv on macOS in libc?

alexcrichton · 2017-05-04T16:28:00Z

Ah yeah sorry to be clear I think adding linkage directives to libstd is ok for now, I think we just want to avoid modifying libc for now due to the impact it'll have.

jonhoo · 2017-05-04T17:09:01Z

@alexcrichton how would I go about doing that?

alexcrichton · 2017-05-04T18:09:47Z

I think the best way would likely be to modify libc-shim's build script in this repo to print out rustc-link-lib=resolv on OSX perhaps?

jonhoo · 2017-05-04T18:21:50Z

Ah, okay -- gave that a shot in 912da9b.

alexcrichton · 2017-05-04T19:16:22Z

Looks pretty good to me, although on second though I think this may actually be best in src/libstd/build.rs as the function's actually used in libstd, sorry about that! While you're at it as well, could you squash the commits into one?

jonhoo · 2017-05-04T19:27:10Z

Done in 43acba5fa62ac0018aa8dd498d5709f07a68bd43

alexcrichton · 2017-05-04T21:25:22Z

@bors: r+

Thanks!

Mark-Simulacrum · 2017-05-05T01:16:32Z

macOS failed again with the same error, I think, though not sure:

[01:28:35]   "_res_9_init", referenced from:
[01:28:35]       std::net::lookup_host::hd3d9eaad793f8abc in libbar.a(libbar.0.o)

jonhoo · 2017-05-05T03:26:41Z

That's so strange... The log even says that it's linking with libresolv. At this point I think I'd need someone running macOS to take a look and track this symbol down? Debugging this is really tricky without direct access to a macOS environment. The only thing that comes to mind is that the symbol may actually be called res_9_init (note the lack of the _ prefix) in recent macOS deployments? If so, I guess we'd need to update libc too @alexcrichton ? Could someone running macOS take a stab at confirming this locally?

alexcrichton · 2017-05-05T03:43:41Z

Oh -lresolv is correct but both of these tests are performing manual linking, so the libs linked just need to be updated (they're run-make tests I believe)

As discussed in rust-lang#41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Introduces an std linkage dependency on libresolv on macOS/iOS (which also makes it necessary to update run-make/tools.mk). Fixes rust-lang#41570. Depends on rust-lang/libc#585.

jonhoo · 2017-05-05T04:02:50Z

Updated src/test/run-make/tools.mk, squashed, and pushed in 68ae617.

alexcrichton · 2017-05-05T04:18:42Z

@bors: r+

bors · 2017-05-05T04:18:43Z

📌 Commit 68ae617 has been approved by alexcrichton

jonhoo · 2017-05-05T04:22:25Z

I'll take that as you agreeing that 68ae617#diff-bff523d3aff3a86f367f9d199a559b71R75 is the right change to make :p

alexcrichton · 2017-05-05T04:25:55Z

Indeed!

jonhoo · 2017-05-05T14:45:34Z

Hmm, seems like @bors is taking a while to pick this up -- @alexcrichton ?

Mark-Simulacrum · 2017-05-05T14:50:10Z

You are currently 4th in the queue: https://buildbot2.rust-lang.org/homu/queue/rust. It can take a while, especially now--we're actively trying to land a patch that fixes nightly, so that has higher priority so whenever we retry it, it jumps to the top.

jonhoo · 2017-05-05T14:53:05Z

Ah, thanks! Didn't know there was a way of looking at the queue. Never mind me :)

@alexcrichton

…-fail, r=alexcrichton Reload nameserver information on lookup failure As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd. Fixes rust-lang#41570. Depends on rust-lang/libc#585. r? @alexcrichton

Rollup of 9 pull requests - Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763 - Failed merges:

bors · 2017-05-06T01:46:47Z

⌛ Testing commit 68ae617 with merge 42a4f37...

retep998 · 2017-05-06T01:48:52Z

Is bors drunk? This PR was just merged in a rollup, so why is bors trying to test it?

jonhoo · 2017-05-11T20:02:38Z

It is worth noting that if https://sourceware.org/bugzilla/show_bug.cgi?id=984 ever lands, we might want to revert this change.

tamird · 2017-05-23T01:54:08Z

src/libstd/sys_common/net.rs

+                // The lookup failure could be caused by using a stale /etc/resolv.conf.
+                // See https://github.com/rust-lang/rust/issues/41570.
+                // We therefore force a reload of the nameserver information.
+                c::res_init();


Doesn't this still result in surprising behaviour if e.g. the contents of /etc/resolv.conf change without the old resolver becoming unusable?

For instance, if I change my DNS resolver without making the old resolver unreachable, I'll never hit this error and any running rust applications will continue to use the old resolver...indefinitely.

Yes. Though if the resolution happens successfully, what is the problem? It's also quite hard to get around that particular case. We could always call res_init, but that seems a little wasteful. The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception). Applications that want to be robust against this could always call libc::res_init directly though of course.

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

True, though that sounds like a very weird setup indeed. One in which you can connect using the resolution information from the old server, but you need to instead connect to the server provided by a new resolver?

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

I did some benchmarks above (#41582 (comment)), and it's not terrible (especially because it doesn't require a syscall), but if we can avoid doing something...

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

No other libcs have this issue. Some of them don't cache /etc/resolv.conf, some integrate with NSS or similar services, which know when the cache should be flushed. I haven't looked into it too carefully. It is unclear what the "right" solution is given that glibc wants to be both fast (i.e., don't do a file read on every connect), and not rely on other services (like NSS).

oconnor663 · 2017-07-14T16:03:04Z

The glibc man page for res_init notes:

The traditional resolver interfaces such as res_init() and res_query() use some static (global) state stored in the _res structure, rendering these functions non-thread-safe.

Is it safe to call res_init like we're doing now? If not, what are the options for making it safer? We could take a global lock, but I don't think that would help if e.g. we're linking against code in other languages that doesn't know about our lock. The same man page notes that there are more recent functions like res_ninit that can use per-thread state. Those might be safer if they're widespread enough to depend on them.

jonhoo · 2017-07-14T16:50:00Z

That's a good point, though through some digging it appears as though res_init is in fact thread-safe. It's unfortunately somewhat tricky to use res_ninit and have it affect the same structure as is used by glibc. There's some discussion of it further down in the glibc man page:

In glibc, when you link with -lpthread, such a per-thread resolver state is already present. It can be accessed using _res', which has been redefined as a macro, in a similar way to what has been done for the errno' and h_errno' variables. This per-thread resolver state is also used for the gethostby*' family of functions, which means that for example `gethostbyname_r' is now fully thread-safe and re-entrant.

This suggests that we could (and probably should) use res_ninit, but we'd need to figure out which per-thread symbol to use. The aliasing magic seems to reside here, but I can't quite tell how it works to provide a per-thread _res symbol through a macro?

oconnor663 · 2017-07-17T14:26:56Z

@jonhoo should we file an issue somewhere for this? I'm not sure I would call it "unsound" so much as "maybe unsound under certain very specific conditions and versions of glibc" :p

jonhoo · 2017-07-17T14:55:39Z

Well, https://sourceware.org/bugzilla/show_bug.cgi?id=984 is now marked as RESOLVED, so it could be that we can now revert this commit entirely (though I haven't tested). I don't know of a version of glibc where our use is unsound, though you may be right that there is one.

oconnor663 · 2017-07-17T15:56:33Z

Existing versions of glibc that don't contain the fix will be around for years though. Probably we can never revert this workaround? :(

oconnor663 · 2017-08-01T17:09:20Z

@jonhoo I'm close to convinced that we've actually started doing something unsafe here. See #43592.

Edit: A fix has landed: #44965

rust-highfive assigned alexcrichton Apr 27, 2017

retep998 reviewed Apr 27, 2017

View reviewed changes

alexcrichton added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 27, 2017

jonhoo mentioned this pull request Apr 30, 2017

Add res_init rust-lang/libc#585

Merged

arielb1 added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 2, 2017

jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 65fcd0d to 1309652 Compare May 4, 2017 01:13

jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 912da9b to 43acba5 Compare May 4, 2017 19:26

jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 43acba5 to 68ae617 Compare May 5, 2017 04:02

frewsxcv mentioned this pull request May 5, 2017

Rollup of 9 pull requests #41773

Merged

bors added a commit that referenced this pull request May 5, 2017

Auto merge of #41773 - frewsxcv:rollup, r=frewsxcv

42a4f37

Rollup of 9 pull requests - Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763 - Failed merges:

bors merged commit 68ae617 into rust-lang:master May 6, 2017

kennytm added a commit to kennytm/rust-ios-android that referenced this pull request May 21, 2017

iOS: Fix fallout from rust-lang/rust#41582.

0fefdab

tamird reviewed May 23, 2017

View reviewed changes

jonhoo mentioned this pull request May 23, 2017

No way to refresh DNS information leading to indefinite network failures #41570

Closed

oconnor663 mentioned this pull request Aug 1, 2017

calling libc::res_init from multiple threads is unsafe on at least OSX #43592

Closed

Reload nameserver information on lookup failure #41582

Reload nameserver information on lookup failure #41582

Uh oh!

Conversation

jonhoo commented Apr 27, 2017

Uh oh!

alexcrichton commented Apr 27, 2017

Uh oh!

jonhoo commented Apr 27, 2017

Uh oh!

retep998 Apr 27, 2017

Choose a reason for hiding this comment

Uh oh!

jonhoo Apr 27, 2017

Choose a reason for hiding this comment

Uh oh!

jonhoo Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arielb1 commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented May 2, 2017

Uh oh!

jonhoo commented May 2, 2017

Uh oh!

alexcrichton commented May 3, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

TimNN commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kennytm commented May 4, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

jonhoo commented May 4, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

Mark-Simulacrum commented May 5, 2017

Uh oh!

jonhoo commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

jonhoo commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

jonhoo commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

jonhoo commented May 5, 2017

Uh oh!

jonhoo Apr 27, 2017 •

edited

Loading

arielb1 commented May 2, 2017 •

edited

Loading

TimNN commented May 4, 2017 •

edited

Loading

jonhoo May 25, 2017 •

edited

Loading

oconnor663 commented Aug 1, 2017 •

edited

Loading