-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement HashBag::difference(&self, other: &HashBag)
#6
Conversation
This corresponds to [`HashSet::difference()`](https://doc.rust-lang.org/std/collections/hash_set/struct.HashSet.html#method.difference), and has the same semantics, and is implemented in a similar manner. This has the following impact on the public API of `hashbag`: ``` % cargo install cargo-public-api % cargo public-api --diff-git-checkouts origin/master implement-difference Removed items from the public API ================================= (none) Changed items in the public API =============================== (none) Added items to the public API ============================= +pub fn hashbag::Difference::fmt(&self, fmt: &mut fmt::Formatter<'_>) -> fmt::Result +pub fn hashbag::Difference::next(&mut self) -> Option<&'a T> +pub fn hashbag::Difference::size_hint(&self) -> (usize, Option<usize>) +pub fn hashbag::HashBag::difference<'a>(&'a self, other: &'a HashBag<T, S>) -> Difference<'a, T, S> +pub struct hashbag::Difference<'a, T, S> +pub type hashbag::Difference::Item = &'a T ```
Codecov Report
|
5b11234
to
e70be47
Compare
Nice setup with code coverage reporting 🤩 It seems to be a bit unreliable though, because the two lines in my diff it reports as not covered clearly are covered. So as far as I can tell this PR diff is now at 100% code coverage. |
Oh, don't worry about the coverage — all good if the measurement is a little off! I wonder whether we want |
Thanks for taking a look! Hmm not sure I follow completely. Are you talking about |
Wait I think I get what you mean now. You mean that the iterator maybe should produce a pair of both an entry and an occurrence count, right? For my use case the behavior I implement is what I need. But perhaps both are needed. I would not be against some other contributor coming along later and adding the other variant. But personally I don't need it (at least not for now). |
Reasoning further: Should I call this method something else? I think not, because I again invoke the rule of least surprise: it makes sense for it to have the same signature and semantics as its libstd counterpart. |
... but I am of course happy to make any changes to this PR that you insist on 🙂 |
And add regression test called `no_off_by_one_diff_skewing()` that the old algo does not pass. This commit relies on a pending contribution to `hashbag`: jonhoo/hashbag#6 To not be blocked on that contribution, I ported that code into our own repo. We can remove that code when/if that contribution is made available in a new `hashbag` release. Closes #50
My thinking was that you can pretty easily remove the count from a difference iterator that produces both the items and the difference in count, so it makes sense to just have that one rather than to have two copies of every single set operation method. |
But why would people discard the occurrence count when using a But I can try it out and show you the result and you can see what you think. But before I try it out I would like to ask you if you want In other words, I propose that the following example code that I added: use hashbag::HashBag;
let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b).cloned().collect();
assert_eq!(expected, actual); is changed to use hashbag::HashBag;
let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b);
assert_eq!(expected, actual); Having to work with an iterator over Looking forward to your response. Thank you for your patience. |
That's exactly why I think
What makes you say that? I mean, it is less convenient than just a
I'd like to avoid that since it forces an unnecessary allocation on the user. Unless, of course, we need to allocate a |
bf114d3
to
4b9fe77
Compare
I played around a bit more with |
I have pushed a new commit now that takes care of the existing comments (but please see my follow-up question about the docs). Looking forward to a new review round. |
Hey everybody, I just wanted to say that actually multiset difference has a well accepted definition as given on this wikipedia page: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations. Basically, In formulas, Hence I would personally stick to the Edit: multisets in Python: from collections import Counter
a = Counter([1,1,2])
b = Counter([1,2,2])
# multiset difference
a - b
# => Counter({1: 1})
# multiset subtraction
a.subtract(b)
a
# => Counter({1: 1, 2: -1}) |
@FedericoStra That's interesting — I wonder why those Stack* posts then seemed to indicate that there was some ambiguity around what |
93d05d5
to
03221ad
Compare
I have pushed another update, and I think all comments have been addressed now. Looking forward to another review round :) |
As another remark, let me point out that if one desires pub fn difference<'a>(&'a self, other: &'a HashBag<T, S>) -> impl Iterator<Item = (&'a T, usize)> + std::iter::FusedIterator {
self.items.iter().filter_map(move |(x, &self_count)| {
let other_count = other.contains(x);
if self_count > other_count {
Some((x, self_count - other_count))
} else {
None
}
})
} I don't want to interfere with the choice, but simply make sure that this possible alternative is on the radar. :) |
Nice 🤩 This substantially simplifies the implementation and passes all tests, including the |
Jonhoo wrote:
There is this little tidbit, about that definition in Remark 2
And from the conclusion of https://arxiv.org/abs/2110.12902
|
Lengthy and slightly unrelated comment regarding the two articles linked by @ratmice
If you read at the very beginning of Section 2.1, the authors rightfully make a clear distinction between multisets, whose codomain is ℕ, and a signed multiset, whose codomain is ℤ. If one is interested in defining the difference of two multisets as a multiset, there isn't much leeway left to get a reasonable definition. One may then decide to define also a
He still defines the difference in the correct way in formula (24). Also, reading Section 7 gives the impression that the author just discovered that he can compute the sum, difference, max and min of two functions... Just look at the beginning of page 8, where he says:
Yes of course it's possible, and has always been. Apart from the non-standard usage of ∩ and ∪ in place of the common ∧ and ∨ to indicate min and max respectively, what's new? This is something people have been doing for millennia. In fact, the whole rambling about "mfunctions" is quite empty: he is just treating "msets" as ordinary real valued function (which they are, although he does not write it anywhere!), nothing more. He seems to be unaware that what he thinks he is inventing is pretty much the concept of function. I would urge you to find a single mathematician who says that this text on page 6 (with no accompanying formula) has any meaning whatsoever:
|
@FedericoStra fwiw, the portion of Jon's comment I meant to reply to instead of what I actually quoted was:
And Jon had mentioned signed values as well, and feel regardless of the quality they at least explain some of the ambiguity, as it is not really an unreasonable assumption that a function like difference follow same laws as the set theoretic difference function... |
I like returning Someone separately made the argument to me that we should really provide an impl HashBag<T> {
pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
self.items.iter().map(|(x, &self_count)| {
(x, self_count, other.contains(x))
}).chain(other.iter().map(|(x, &other_count)| {
(self.contains(x) == 0).then_some((x, self_count, other_count))
}))
}
pub fn difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, sc, oc| (sc > oc).then_some((k, sc - oc)))
}
pub fn not_in<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, _, oc| (oc == 0).then_some((k, sc)))
}
pub fn signed_difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, isize)> {
self.outer_join(other).map(|k, sc, oc| (k, sc as isize - oc as isize))
}
} (the Don't feel like you have to make that change in this PR though! |
Instead of a struct. Co-authored-by: Federico Stra <[email protected]>
Great! Commit pushed. |
Published in 0.1.6 🎉 If you (or anyone else) wants to take a stab at |
I just tried it out in #8 :) |
I think for better performance one should use In your `impl HashBag` you can add the following:
impl<T> HashBag<T> {
pub fn sum<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
self.outer_join(other).map(|k, sc, oc| (k, sc + oc))
}
pub fn union<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
self.outer_join(other).map(|k, sc, oc| (k, sc.max(oc)))
}
} |
@FedericoStra What makes you say it'd be better? You'd still need to look up in |
@jonhoo If I'm not mistaken, with pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
self.items.iter().map(|(x, &self_count)| {
(x, self_count, other.contains(x))
}).chain(other.iter().map(|(x, &other_count)| {
(self.contains(x) == 0).then_some((x, self_count, other_count))
}))
} one iterates also over items which are contained exclusively in |
That's why the
|
Yes you are right, for some reasons I misread the But still, we know a priori that |
Mostly to avoid duplicating the logic for doing the lookup into |
This corresponds to
HashSet::difference()
, and has the same semantics, and is implemented in a similar manner.This PR has the following impact on the public API of
hashbag
: