-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: impl canonicalize_into for SparseArray #2420
base: develop
Are you sure you want to change the base?
Conversation
a10y
commented
Feb 19, 2025
•
edited
Loading
edited
encodings/sparse/src/canonical.rs
Outdated
canonicalize_primitive_into(&array, builder)?; | ||
}); | ||
} | ||
_ => unreachable!("unsupported SparseArray dtype {}", array.dtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably error here instead of panic, just because there's no compile-time safety that Sparse arrays don't one day start supporting other DTypes. I don't see why they couldn't, e.g. support ExtDType?
Could even fall back to builder.extend(array.into_canonical())
.fill_scalar() | ||
.as_primitive() | ||
.typed_value() | ||
.vortex_expect("fill value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional when the fill value is null. I don't think you can expect this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct 🤦
.fill(MaybeUninit::new(fill_value)); | ||
// SAFETY: we just filled the buffer with the fill value. | ||
unsafe { | ||
builder.values.set_len(sparse.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit, but putting the semi-colon after the }
keeps it on one line.
.vortex_expect("fill value"); | ||
|
||
builder | ||
.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to me that this is public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is a bit odd, I think Dan exposed it as part of his bitpacking canonicalize_into. I can change PrimitiveBuilder to expose a spare_capacity_mut instead
} | ||
}); | ||
|
||
// Set the validity from the sparse array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a shame after specifically patching values, to then canonicalize the validity into a BooleanBuffer when there may only be a couple of nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily had a peek at the benchmarks. Looks good, some tiny style suggestions.
@@ -28,5 +28,11 @@ vortex-mask = { workspace = true } | |||
vortex-scalar = { workspace = true } | |||
|
|||
[dev-dependencies] | |||
divan = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
use vortex_scalar::Scalar; | ||
use vortex_sparse::SparseArray; | ||
|
||
fn generate_sparse_array(len: usize, sparsity: f64) -> SparseArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny style nits:
- could consider going ordering functions by call order (been told by @gatesn that's how we do it :)
=> main first, then the benchmarks, then the helper. I primarily care about the benchmarks being above the helpers - factoring out shared bench parametrization into
const BENCH_ARGS
, e.g. seeencodings/fsst/benches/fsst_compress.rs
Not really crucial, more like to get the benchmarks into similar shape and easily human parseable due to consistency.
fn into_canonical(bencher: Bencher, sparsity: f64) { | ||
bencher | ||
.with_inputs(|| generate_sparse_array(64_000, sparsity)) | ||
.bench_local_values(|sparse_array| sparse_array.into_canonical().unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on your take about bench_local_values
vs bench_values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was sort of an arbitrary choice, I think it's probably fine to parallelize so can switch to bench_values
match_each_integer_ptype!(indices.ptype(), |$I| { | ||
let indices = indices.as_slice::<$I>(); | ||
for (&index, value) in indices.iter().zip(values.boolean_buffer().into_iter()) { | ||
builder.inner.set_bit(index as usize, value); | ||
} | ||
}); | ||
|
||
// Set the validity from the sparse array. | ||
builder.nulls.append_validity_mask(sparse.validity_mask()?); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this into the builder?
builder | ||
.values | ||
.spare_capacity_mut() | ||
.fill(MaybeUninit::new(fill_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this not fill past the arr len? Is that performance or correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I can just trim this to [..len]
and that's probably better
Did sparse show up as slow to canonicalize? |
Moving to draft to clean things up. @joseph-isaacs We use Sparse as a standalone compressor in vortex-btrblocks so I'd expect it to be more frequent in our encoding trees |
4ef8d81
to
ad9eb84
Compare
CodSpeed Performance ReportMerging #2420 will degrade performances by 23.33%Comparing Summary
Benchmarks breakdown
|