-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tweak f64 deserialization to make float roundtrips work #671
Conversation
WDYT about testing every f32? :-] |
let start = Instant::now();
static N: AtomicU32 = AtomicU32::new(0);
(0..=u32::MAX).into_par_iter().for_each(|i| {
let input = f32::from_bits(i) as f64;
if input.is_finite() {
let json = serde_json::to_string(&input).unwrap();
let output: f64 = serde_json::from_str(&json).unwrap();
if input != output {
let x = N.fetch_add(1, Ordering::Relaxed);
println!("errors: {} with {:#010x}", x + 1, i);
}
}
});
let end = Instant::now();
println!("errors: {}", N.load(Ordering::Relaxed));
println!("DONE in {:?}", end - start); DONE in 380.674042494s That works fine but takes 6-7 minutes so probably way too long to run on CI (which probably has less parallelism). If I deserialize/serialize to - let input = f32::from_bits(i) as f64;
+ let input = f32::from_bits(i);
if input.is_finite() {
let json = serde_json::to_string(&input).unwrap();
- let output: f64 = serde_json::from_str(&json).unwrap();
+ let output: f32 = serde_json::from_str(&json).unwrap(); errors: 1 with 0x95ae43fd
errors: 2 with 0x15ae43fd Will look into the errors. |
OK, I see the issue. when serializing we use the f32 variant of ryu to serialize f32 values and the f64 variant to deal with f64 inputs. when deserializing everything goes through the 64-bit variant of minimal-lexical and then output gets casted to an f32 if the user requested that; doing it like that results in an error of 1 ULP for using the 32-bit variant of minimal-lexical (on the ryu-stringified value of EDIT: added some code to visualize this better. let x = 0x15ae43fd;
let y = f32::from_bits(x);
println!("u32: {:#010x}", x);
println!("std: {}", y);
println!("ryu: {}", ryu::Buffer::new().format_finite(y));
let a = minimal_lexical::parse_float::<f32, _, _>(b"7".iter(), b"038531".iter(), -26);
let b = minimal_lexical::parse_float::<f64, _, _>(b"7".iter(), b"038531".iter(), -26) as f32;
println!("ml32: {} ({:#010x})", a, a.to_bits());
println!("ml64: {} ({:#010x})", b, b.to_bits()); u32: 0x15ae43fd
std: 0.00000000000000000000000007038531
ryu: 7.038531e-26
ml32: 0.00000000000000000000000007038531 (0x15ae43fd)
ml64: 0.000000000000000000000000070385313 (0x15ae43fe) |
instead of using the 64-bit version and then casting down to f32
Done, f32 roundtrip is now working for all finite f32 values (and the program seems to terminate a bit faster?). This change will bring in the |
@japaric The |
ah, yes. You are right. Deserialization also allocates because Using EDIT: for comparison, the rayon + to_string version also take ~5 minutes. |
@japaric Could you post your updated "zero-alloc in the hot sections" version? |
const NTHREADS: u32 = 8;
const STEP: u32 = u32::MAX / NTHREADS;
let mut threads = vec![];
for i in 0..(NTHREADS - 1) {
threads.push(thread::spawn(move || {
let mut json = vec![];
let start = Instant::now();
for j in i * STEP..(i + 1) * STEP {
let input = f32::from_bits(j);
if input.is_finite() {
json.clear();
serde_json::to_writer(&mut json, &input).unwrap();
let output: f32 = serde_json::from_slice(&json).unwrap();
assert_eq!(input, output);
}
}
println!("({}) DONE in {:?}", i, start.elapsed());
}));
}
let start = Instant::now();
let mut json = vec![];
for j in (NTHREADS - 1) * STEP..=u32::MAX {
let input = f32::from_bits(j);
if input.is_finite() {
json.clear();
serde_json::to_writer(&mut json, &input).unwrap();
let output: f32 = serde_json::from_slice(&json).unwrap();
assert_eq!(input, output);
}
}
println!("({}) DONE in {:?}", NTHREADS - 1, start.elapsed());
threads.into_iter().for_each(|t| t.join().unwrap()); is what I tried. It doesn't do load balancing but all threads complete in about the same amount of time (270-300s) so maybe load balancing is not too important. |
@dtolnay did you have a chance to look at this PR? 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I wasn't immediately able to reproduce your runtime perf numbers from the PR description. I mainly care about canada.json for now since that's the one that contains floats. For that file you got pretty much identical perf before and after this PR. On my laptop I get:
Dataset | DOM - master | DOM - PR | STRUCT - master | STRUCT - PR |
---|---|---|---|---|
canada | 280 MB/s | 130 MB/s | 490 MB/s | 170 MB/s |
which is a pretty big slowdown, and reproduces on 3 machines I tried. Here is the build command I am using:
CARGO_INCREMENTAL=0 \
RUSTFLAGS='-C codegen-units=1 -C target-cpu=native' \
cargo build \
--release \
--no-default-features \
--features parse-dom,parse-struct,lib-serde,all-files
Can you double check that your benchmark result is actually using the changes from the PR?
Less importantly, I noticed this extremely suspicious code in minimal-lexical which blindly writes past the end of a Vec. If there is some precondition on the capacity of the Vec there, then insert_many
would need to be made unsafe to call, otherwise it seems like insert_many
should take care of raising the capacity of the Vec when needed.
On this comment:
- there's also the option of enabling the
arrayvec
feature or writing a custom stack allocatedVec
that doesn't bring any extra dependencies -- though the heap-allocatedVec
doesn't seem to degrade performance as measured by json-benchmark
It's possible the reason no difference was observed is that the no_alloc
Cargo feature in minimal-lexical looks like it currently doesn't do anything, since this code is looking at cfg(no_alloc)
rather than cfg(feature = "no_alloc")
. We should try it with RUSTFLAGS='--cfg no_alloc'
+ --features minimal-lexical/no_alloc
.
@@ -16,6 +16,7 @@ edition = "2018" | |||
serde = { version = "1.0.100", default-features = false } | |||
indexmap = { version = "1.2", optional = true } | |||
itoa = { version = "0.4.3", default-features = false } | |||
minimal-lexical = { git = "https://github.com/Alexhuszagh/minimal-lexical" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to use default-features = false
and then have our std feature in this crate enable "minimal-lexical/std". Otherwise there would end up being a std dependency through minimal-lexical even when serde_json/std is not enabled. I added a CI build in d6080d9 that would have caught this.
Compiling minimal-lexical v0.1.0 (https://github.com/Alexhuszagh/minimal-lexical#fa1e491b)
error[E0463]: can't find crate for `std`
Courtesy of libFuzzer, here is an input that triggers the out of bounds write: fn main() {
let j = "144440.000312571602869014401444400000312571602869014031257160286901444400000312144440000031257160286901444444440000312571602869014444000003125716028690144440000031257160286901444400000312571602869014444000031250312571602312571602860003125716023125716028694400003125716028690144440000031254400000312571602869014444000003125716028690144440000312503125716023125716028600031257160231257160286";
let _ = serde_json::from_str::<f64>(j);
} ==41341== Invalid write of size 8
==41341== at 0x4841C24: memmove (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341== by 0x123B6C: core::intrinsics::copy (intrinsics.rs:2127)
==41341== by 0x11E02F: minimal_lexical::math::insert_many (math.rs:278)
==41341== by 0x12EEBF: minimal_lexical::math::small::ishl_limbs (math.rs:664)
==41341== by 0x12F154: minimal_lexical::math::small::ishl (math.rs:678)
==41341== by 0x12CA89: minimal_lexical::math::Math::ishl (math.rs:1046)
==41341== by 0x12CC2D: minimal_lexical::math::Math::imul_pow2 (math.rs:1025)
==41341== by 0x127626: minimal_lexical::bhcomp::small_atof (bhcomp.rs:176)
==41341== by 0x128778: minimal_lexical::bhcomp::bhcomp (bhcomp.rs:225)
==41341== by 0x121DC0: minimal_lexical::algorithm::fallback_path (algorithm.rs:176)
==41341== by 0x11BF6B: minimal_lexical::parse::parse_float (parse.rs:80)
==41341== by 0x11657A: serde_json::de::Deserializer<R>::f64_from_parts (de.rs:752)
==41341== Address 0x4ab8430 is 0 bytes after a block of size 160 alloc'd
==41341== at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==41341== by 0x13790B: alloc::alloc::alloc (alloc.rs:80)
==41341== by 0x137E23: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
==41341== by 0x133E50: alloc::raw_vec::finish_grow (raw_vec.rs:514)
==41341== by 0x1348DA: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:440)
==41341== by 0x134346: alloc::raw_vec::RawVec<T,A>::try_reserve (raw_vec.rs:281)
==41341== by 0x134B22: alloc::raw_vec::RawVec<T,A>::reserve (raw_vec.rs:267)
==41341== by 0x130998: alloc::vec::Vec<T>::reserve (vec.rs:505)
==41341== by 0x1387B3: minimal_lexical::math::reserve (math.rs:336)
==41341== by 0x1323E0: <minimal_lexical::bignum::Bigint as core::default::Default>::default (bignum.rs:16)
==41341== by 0x12CB30: minimal_lexical::math::Math::from_u64 (math.rs:991)
==41341== by 0x12752D: minimal_lexical::bhcomp::small_atof (bhcomp.rs:145) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going ahead and merging this since it's pretty clear that an accurate float parser is worth having. I think the remaining work will be looking into whether we can recover some of the performance and then making a call as to whether this should be behind a feature and whether that feature should be on or off by default.
Thanks all!
@dtolnay Thanks! Should we create a new issue to dig into the |
Thanks for catching this. I just fuzzed minimal-lexical and this appeared as well, which has been patched in lexical upstream (by replacing the @rw This should be patched extensively in minimal-lexical, since the only unsafe usage usage should have been that algorithm. Still, a great reason why to fuzz libraries, and give a second eye, especially when porting code. Thank you. This would have been a very serious issue, so thank you. |
This is a new attempt at fixing issue #536 using the latest version of @Alexhuszagh 's
minimal-lexical
library for deserialization of f64 values. The goal is to make float roundtrip work without severely impacting compile times.I have:
quickcheck
but didn't add that here as it would bring it a new dev-dependency)perf
(see table below)Note that:
minimal-lexical
crate itself does not have any dependenciesminimal-lexical
will use a heap-allocatedVec
for its internal, intermediate buffer instead of stack allocated one (which would bring in thearrayvec
dependency)arrayvec
feature or writing a custom stack allocatedVec
that doesn't bring any extra dependencies -- though the heap-allocatedVec
doesn't seem to degrade performance as measured by json-benchmarkcompile time changes
(compiler used: 1.43.1)
I have measured
perf stat cargo $command
on theserde-json
crate after acargo clean
for these values of$command
:check
,build
,build --release
and a faster version ofbuild --release
. The table reports the changes ininstructions:u
both in absolute value and as a percentage. The table also includes the initial and the change in wall time, to get a more "human" feel for the change. I ran perf a few times for each command; the values reported here are median values.check
build
b --release
b --release
++The faster
build --release
consists of adding the profile override shown below to compile all build scripts and proc macros without optimizations.(off-topic: Your mileage may vary though with this profile trick. I known that un-optimized
bindgen
runs very slowly so this actually backfires when you have that in your dependency graph (or at least that was the case many months ago; haven't usebindgen
with embedded code in a long time))runtime perf
I ran
json-benchmark
with this branch and master and got theseparse
results. Again, all values are median values out of a few runs.These values fluctuated quite a bit so I wouldn't say this PR improves or degrades performance.
@dtolnay do these changes in compile time, in exchange for the round-trip correctness, seem like an acceptable trade-off to you?