tweak f64 deserialization to make float roundtrips work #671

japaric · 2020-05-13T16:17:29Z

This is a new attempt at fixing issue #536 using the latest version of @Alexhuszagh 's minimal-lexical library for deserialization of f64 values. The goal is to make float roundtrip work without severely impacting compile times.

I have:

checked that the crate still compiles with Rust 1.31
added a sample of roundtrip tests that fail with the current deserializer but pass with this branch (also did some testing with quickcheck but didn't add that here as it would bring it a new dev-dependency)
ran this loop test for over 45 minutes
measured changes in compile time using perf (see table below)
benchmarked this branch and compared it against master (results at the bottom)

Note that:

as used here (with default features off), the minimal-lexical crate itself does not have any dependencies
but minimal-lexical will use a heap-allocated Vec for its internal, intermediate buffer instead of stack allocated one (which would bring in the arrayvec dependency)
there's also the option of enabling the arrayvec feature or writing a custom stack allocated Vec that doesn't bring any extra dependencies -- though the heap-allocated Vec doesn't seem to degrade performance as measured by json-benchmark

compile time changes

(compiler used: 1.43.1)

I have measured perf stat cargo $command on the serde-json crate after a cargo clean for these values of $command: check, build, build --release and a faster version of build --release. The table reports the changes in instructions:u both in absolute value and as a percentage. The table also includes the initial and the change in wall time, to get a more "human" feel for the change. I ran perf a few times for each command; the values reported here are median values.

Command	Δ insn	Time (master)	Δ Time
`check`	+1.848G (+1.6%)	14.584s	+0.232s (+1.59%)
`build`	+3.442G (+2.75%)	15.471	+0.323s (+2.09%)
`b --release`	+6.004G (+1.46%)	30.432	+0.675s (+2.219%)
`b --release`++	+6.219 (+5.28%)	14.729	+0.31s (+2.106%)

The faster build --release consists of adding the profile override shown below to compile all build scripts and proc macros without optimizations.

[profile.release.build-override]
codegen-units = 16
debug = false
debug-assertions = false
opt-level = 0
overflow-checks = false

(off-topic: Your mileage may vary though with this profile trick. I known that un-optimized bindgen runs very slowly so this actually backfires when you have that in your dependency graph (or at least that was the case many months ago; haven't use bindgen with embedded code in a long time))

runtime perf

I ran json-benchmark with this branch and master and got these parse results. Again, all values are median values out of a few runs.

Dataset	DOM - master	DOM - PR	STRUCT - master	STRUCT - PR
canada	250 MB/s	260 MB/s	480 MB/s	480 MB/s
citm_catalog	380 MB/s	400 MB/s	830 MB/s	810 MB/s
twitter	280 MB/s	280 MB/s	610 MB/s	620 MB/s

These values fluctuated quite a bit so I wouldn't say this PR improves or degrades performance.

@dtolnay do these changes in compile time, in exchange for the round-trip correctness, seem like an acceptable trade-off to you?

src/de.rs

rw · 2020-05-13T18:40:33Z

WDYT about testing every f32? :-]

japaric · 2020-05-14T09:39:36Z

WDYT about testing every f32? :-]

let start = Instant::now();
static N: AtomicU32 = AtomicU32::new(0);
(0..=u32::MAX).into_par_iter().for_each(|i| {
    let input = f32::from_bits(i) as f64;
    if input.is_finite() {
        let json = serde_json::to_string(&input).unwrap();
        let output: f64 = serde_json::from_str(&json).unwrap();
        if input != output {
            let x = N.fetch_add(1, Ordering::Relaxed);
            println!("errors: {} with {:#010x}", x + 1, i);
        }
    }
});
let end = Instant::now();

println!("errors: {}", N.load(Ordering::Relaxed));
println!("DONE in {:?}", end - start);

DONE in 380.674042494s

That works fine but takes 6-7 minutes so probably way too long to run on CI (which probably has less parallelism).

If I deserialize/serialize to f32 instead of f64 I get two errors though:

-        let input = f32::from_bits(i) as f64;
+        let input = f32::from_bits(i);
         if input.is_finite() {
             let json = serde_json::to_string(&input).unwrap();
-            let output: f64 = serde_json::from_str(&json).unwrap();
+            let output: f32 = serde_json::from_str(&json).unwrap();

errors: 1 with 0x95ae43fd
errors: 2 with 0x15ae43fd

Will look into the errors.

japaric · 2020-05-14T10:08:22Z

errors: 1 with 0x95ae43fd
errors: 2 with 0x15ae43fd

OK, I see the issue. when serializing we use the f32 variant of ryu to serialize f32 values and the f64 variant to deal with f64 inputs. when deserializing everything goes through the 64-bit variant of minimal-lexical and then output gets casted to an f32 if the user requested that; doing it like that results in an error of 1 ULP for 0x15ae43fd (0x95ae43fd is the negated value).

using the 32-bit variant of minimal-lexical (on the ryu-stringified value of 0x15ae43fd) does produce the correct value so we should use that one if the requested output is the f32 type.

EDIT: added some code to visualize this better.

let x = 0x15ae43fd;
let y = f32::from_bits(x);
println!("u32: {:#010x}", x);
println!("std: {}", y);
println!("ryu: {}", ryu::Buffer::new().format_finite(y));
let a = minimal_lexical::parse_float::<f32, _, _>(b"7".iter(), b"038531".iter(), -26);
let b = minimal_lexical::parse_float::<f64, _, _>(b"7".iter(), b"038531".iter(), -26) as f32;
println!("ml32: {} ({:#010x})", a, a.to_bits());
println!("ml64: {} ({:#010x})", b, b.to_bits());

u32: 0x15ae43fd
std: 0.00000000000000000000000007038531
ryu: 7.038531e-26
ml32: 0.00000000000000000000000007038531 (0x15ae43fd)
ml64: 0.000000000000000000000000070385313 (0x15ae43fe)

instead of using the 64-bit version and then casting down to f32

japaric · 2020-05-14T10:49:01Z

we should use that one if the requested output is the f32 type.

Done, f32 roundtrip is now working for all finite f32 values (and the program seems to terminate a bit faster?). This change will bring in the f32 monophormization of parse_float; that seems to increase the # of instructions reported by perf stat cargo build by ~0.13%.

rw · 2020-05-15T07:34:42Z

@japaric The serde_json::to_string is probably allocating on every execution. I'd try re-using a String as a buffer, instead.

japaric · 2020-05-15T08:54:44Z

ah, yes. You are right. Deserialization also allocates because Deserializer contains a Vec (that was there before this PR). I don't see a way to re-use a Deserializer to avoid that Vec being allocated and deallocated each time a value is deserialized though, at least not using the public API.

Using to_writer, testing the whole 32-bit range takes ~11 minutes in single-threaded mode, ~5 minutes with 8 threads (using just std::thread and no external dependencies) and ~6 minutes with 4 threads -- I guess more threads don't help in the program I wrote because there's no load balancing. I can commit either version and if the test is deemed to take too long it could be run on a daily basis using cron (rather than on each commit / PR).

EDIT: for comparison, the rayon + to_string version also take ~5 minutes.

rw · 2020-05-15T20:10:45Z

@japaric Could you post your updated "zero-alloc in the hot sections" version?

japaric · 2020-05-18T11:03:21Z

const NTHREADS: u32 = 8;
const STEP: u32 = u32::MAX / NTHREADS;

let mut threads = vec![];
for i in 0..(NTHREADS - 1) {
    threads.push(thread::spawn(move || {
        let mut json = vec![];

        let start = Instant::now();
        for j in i * STEP..(i + 1) * STEP {
            let input = f32::from_bits(j);
            if input.is_finite() {
                json.clear();
                serde_json::to_writer(&mut json, &input).unwrap();
                let output: f32 = serde_json::from_slice(&json).unwrap();
                assert_eq!(input, output);
            }
        }
        println!("({}) DONE in {:?}", i, start.elapsed());
    }));
}

let start = Instant::now();
let mut json = vec![];
for j in (NTHREADS - 1) * STEP..=u32::MAX {
    let input = f32::from_bits(j);
    if input.is_finite() {
        json.clear();
        serde_json::to_writer(&mut json, &input).unwrap();
        let output: f32 = serde_json::from_slice(&json).unwrap();
        assert_eq!(input, output);
    }
}
println!("({}) DONE in {:?}", NTHREADS - 1, start.elapsed());

threads.into_iter().for_each(|t| t.join().unwrap());

is what I tried. It doesn't do load balancing but all threads complete in about the same amount of time (270-300s) so maybe load balancing is not too important.

japaric · 2020-05-26T16:59:49Z

@dtolnay did you have a chance to look at this PR? 👀

dtolnay

Thanks for the PR!

I wasn't immediately able to reproduce your runtime perf numbers from the PR description. I mainly care about canada.json for now since that's the one that contains floats. For that file you got pretty much identical perf before and after this PR. On my laptop I get:

Dataset	DOM - master	DOM - PR	STRUCT - master	STRUCT - PR
canada	280 MB/s	130 MB/s	490 MB/s	170 MB/s

which is a pretty big slowdown, and reproduces on 3 machines I tried. Here is the build command I am using:

CARGO_INCREMENTAL=0 \
    RUSTFLAGS='-C codegen-units=1 -C target-cpu=native' \
    cargo build \
    --release \
    --no-default-features \
    --features parse-dom,parse-struct,lib-serde,all-files

Can you double check that your benchmark result is actually using the changes from the PR?

Less importantly, I noticed this extremely suspicious code in minimal-lexical which blindly writes past the end of a Vec. If there is some precondition on the capacity of the Vec there, then insert_many would need to be made unsafe to call, otherwise it seems like insert_many should take care of raising the capacity of the Vec when needed.

On this comment:

there's also the option of enabling the arrayvec feature or writing a custom stack allocated Vec that doesn't bring any extra dependencies -- though the heap-allocated Vec doesn't seem to degrade performance as measured by json-benchmark

It's possible the reason no difference was observed is that the no_alloc Cargo feature in minimal-lexical looks like it currently doesn't do anything, since this code is looking at cfg(no_alloc) rather than cfg(feature = "no_alloc"). We should try it with RUSTFLAGS='--cfg no_alloc' + --features minimal-lexical/no_alloc.

dtolnay · 2020-06-05T06:55:24Z

Cargo.toml

@@ -16,6 +16,7 @@ edition = "2018"
 serde = { version = "1.0.100", default-features = false }
 indexmap = { version = "1.2", optional = true }
 itoa = { version = "0.4.3", default-features = false }
+minimal-lexical = { git = "https://github.com/Alexhuszagh/minimal-lexical" }


I think this needs to use default-features = false and then have our std feature in this crate enable "minimal-lexical/std". Otherwise there would end up being a std dependency through minimal-lexical even when serde_json/std is not enabled. I added a CI build in d6080d9 that would have caught this.

Compiling minimal-lexical v0.1.0 (https://github.com/Alexhuszagh/minimal-lexical#fa1e491b) error[E0463]: can't find crate for `std`

dtolnay · 2020-06-07T06:42:13Z

Courtesy of libFuzzer, here is an input that triggers the out of bounds write:

fn main() {
    let j = "144440.000312571602869014401444400000312571602869014031257160286901444400000312144440000031257160286901444444440000312571602869014444000003125716028690144440000031257160286901444400000312571602869014444000031250312571602312571602860003125716023125716028694400003125716028690144440000031254400000312571602869014444000003125716028690144440000312503125716023125716028600031257160231257160286";
    let _ = serde_json::from_str::<f64>(j);
}

==41341== Invalid write of size 8
==41341==    at 0x4841C24: memmove (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341==    by 0x123B6C: core::intrinsics::copy (intrinsics.rs:2127)
==41341==    by 0x11E02F: minimal_lexical::math::insert_many (math.rs:278)
==41341==    by 0x12EEBF: minimal_lexical::math::small::ishl_limbs (math.rs:664)
==41341==    by 0x12F154: minimal_lexical::math::small::ishl (math.rs:678)
==41341==    by 0x12CA89: minimal_lexical::math::Math::ishl (math.rs:1046)
==41341==    by 0x12CC2D: minimal_lexical::math::Math::imul_pow2 (math.rs:1025)
==41341==    by 0x127626: minimal_lexical::bhcomp::small_atof (bhcomp.rs:176)
==41341==    by 0x128778: minimal_lexical::bhcomp::bhcomp (bhcomp.rs:225)
==41341==    by 0x121DC0: minimal_lexical::algorithm::fallback_path (algorithm.rs:176)
==41341==    by 0x11BF6B: minimal_lexical::parse::parse_float (parse.rs:80)
==41341==    by 0x11657A: serde_json::de::Deserializer<R>::f64_from_parts (de.rs:752)
==41341==  Address 0x4ab8430 is 0 bytes after a block of size 160 alloc'd
==41341==    at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341==    by 0x13790B: alloc::alloc::alloc (alloc.rs:80)
==41341==    by 0x137E23: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
==41341==    by 0x133E50: alloc::raw_vec::finish_grow (raw_vec.rs:514)
==41341==    by 0x1348DA: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:440)
==41341==    by 0x134346: alloc::raw_vec::RawVec<T,A>::try_reserve (raw_vec.rs:281)
==41341==    by 0x134B22: alloc::raw_vec::RawVec<T,A>::reserve (raw_vec.rs:267)
==41341==    by 0x130998: alloc::vec::Vec<T>::reserve (vec.rs:505)
==41341==    by 0x1387B3: minimal_lexical::math::reserve (math.rs:336)
==41341==    by 0x1323E0: <minimal_lexical::bignum::Bigint as core::default::Default>::default (bignum.rs:16)
==41341==    by 0x12CB30: minimal_lexical::math::Math::from_u64 (math.rs:991)
==41341==    by 0x12752D: minimal_lexical::bhcomp::small_atof (bhcomp.rs:145)

dtolnay

I am going ahead and merging this since it's pretty clear that an accurate float parser is worth having. I think the remaining work will be looking into whether we can recover some of the performance and then making a call as to whether this should be behind a feature and whether that feature should be on or off by default.

Thanks all!

rw · 2020-06-12T22:14:46Z

@dtolnay Thanks! Should we create a new issue to dig into the unsafe usage in minimal-lexical?

Alexhuszagh · 2021-04-27T20:50:58Z

Thanks for catching this. I just fuzzed minimal-lexical and this appeared as well, which has been patched in lexical upstream (by replacing the insert_many algorithm as per Alexhuszagh/rust-lexical#53. This was somehow never an issue in rust-lexical at all, but in porting lexical to minimal-lexical, I apparently deleted that line due to the assumption of a stack allocated buffer.

@rw This should be patched extensively in minimal-lexical, since the only unsafe usage usage should have been that algorithm. Still, a great reason why to fuzz libraries, and give a second eye, especially when porting code. Thank you. This would have been a very serious issue, so thank you.

japaric added 2 commits May 13, 2020 17:59

integrate minimal-lexical

f7a7815

add sample roundtrip test

cfd0a31

japaric commented May 13, 2020

View reviewed changes

src/de.rs Show resolved Hide resolved

japaric mentioned this pull request May 13, 2020

Perfect accuracy float parsing #536

Closed

make clippy happy

e004665

if f32 was requested use 32-bit minimal-lexical

0d97e08

instead of using the 64-bit version and then casting down to f32

dtolnay reviewed Jun 7, 2020

View reviewed changes

dtolnay approved these changes Jun 7, 2020

View reviewed changes

dtolnay merged commit adafa93 into serde-rs:master Jun 7, 2020

dtolnay mentioned this pull request Jun 8, 2020

Deduplicate significand computation #680

Merged

Alexhuszagh mentioned this pull request Apr 27, 2021

Double Free Issue Alexhuszagh/minimal-lexical#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tweak f64 deserialization to make float roundtrips work #671

tweak f64 deserialization to make float roundtrips work #671

japaric commented May 13, 2020 •

edited

Loading

rw commented May 13, 2020

japaric commented May 14, 2020

japaric commented May 14, 2020 •

edited

Loading

japaric commented May 14, 2020

rw commented May 15, 2020

japaric commented May 15, 2020 •

edited

Loading

rw commented May 15, 2020

japaric commented May 18, 2020

japaric commented May 26, 2020

dtolnay left a comment

dtolnay Jun 5, 2020

dtolnay commented Jun 7, 2020

dtolnay left a comment

rw commented Jun 12, 2020

Alexhuszagh commented Apr 27, 2021 •

edited

Loading

tweak f64 deserialization to make float roundtrips work #671

tweak f64 deserialization to make float roundtrips work #671

Conversation

japaric commented May 13, 2020 • edited Loading

compile time changes

runtime perf

rw commented May 13, 2020

japaric commented May 14, 2020

japaric commented May 14, 2020 • edited Loading

japaric commented May 14, 2020

rw commented May 15, 2020

japaric commented May 15, 2020 • edited Loading

rw commented May 15, 2020

japaric commented May 18, 2020

japaric commented May 26, 2020

dtolnay left a comment

Choose a reason for hiding this comment

dtolnay Jun 5, 2020

Choose a reason for hiding this comment

dtolnay commented Jun 7, 2020

dtolnay left a comment

Choose a reason for hiding this comment

rw commented Jun 12, 2020

Alexhuszagh commented Apr 27, 2021 • edited Loading

japaric commented May 13, 2020 •

edited

Loading

japaric commented May 14, 2020 •

edited

Loading

japaric commented May 15, 2020 •

edited

Loading

Alexhuszagh commented Apr 27, 2021 •

edited

Loading