Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HashBag::difference(&self, other: &HashBag) #6

Merged
merged 9 commits into from
Jun 22, 2022

Conversation

Enselic
Copy link
Contributor

@Enselic Enselic commented Jun 5, 2022

This corresponds to HashSet::difference(), and has the same semantics, and is implemented in a similar manner.

This PR has the following impact on the public API of hashbag:

% cargo install cargo-public-api
% cargo public-api --diff-git-checkouts origin/master implement-difference
Removed items from the public API
=================================
(none)

Changed items in the public API
===============================
(none)

Added items to the public API
=============================
+pub fn hashbag::Difference::fmt(&self, fmt: &mut fmt::Formatter<'_>) -> fmt::Result
+pub fn hashbag::Difference::next(&mut self) -> Option<&'a T>
+pub fn hashbag::Difference::size_hint(&self) -> (usize, Option<usize>)
+pub fn hashbag::HashBag::difference<'a>(&'a self, other: &'a HashBag<T, S>) -> Difference<'a, T, S>
+pub struct hashbag::Difference<'a, T, S>
+pub type hashbag::Difference::Item = &'a T

This corresponds to
[`HashSet::difference()`](https://doc.rust-lang.org/std/collections/hash_set/struct.HashSet.html#method.difference),
and has the same semantics, and is implemented in a similar manner.

This has the following impact on the public API of `hashbag`:

```
% cargo install cargo-public-api
% cargo public-api --diff-git-checkouts origin/master implement-difference
Removed items from the public API
=================================
(none)

Changed items in the public API
===============================
(none)

Added items to the public API
=============================
+pub fn hashbag::Difference::fmt(&self, fmt: &mut fmt::Formatter<'_>) -> fmt::Result
+pub fn hashbag::Difference::next(&mut self) -> Option<&'a T>
+pub fn hashbag::Difference::size_hint(&self) -> (usize, Option<usize>)
+pub fn hashbag::HashBag::difference<'a>(&'a self, other: &'a HashBag<T, S>) -> Difference<'a, T, S>
+pub struct hashbag::Difference<'a, T, S>
+pub type hashbag::Difference::Item = &'a T
```
@codecov
Copy link

codecov bot commented Jun 5, 2022

Codecov Report

Merging #6 (70ff9c1) into master (75e9f0d) will decrease coverage by 0.7%.
The diff coverage is 100.0%.

Impacted Files Coverage Δ
src/lib.rs 92.6% <100.0%> (-2.6%) ⬇️

@Enselic Enselic force-pushed the implement-difference branch from 5b11234 to e70be47 Compare June 5, 2022 08:13
@Enselic
Copy link
Contributor Author

Enselic commented Jun 5, 2022

Nice setup with code coverage reporting 🤩 It seems to be a bit unreliable though, because the two lines in my diff it reports as not covered clearly are covered. So as far as I can tell this PR diff is now at 100% code coverage.

@jonhoo
Copy link
Owner

jonhoo commented Jun 5, 2022

Oh, don't worry about the coverage — all good if the measurement is a little off!

I wonder whether we want Difference to produce the count of the difference and not just the set of items in the difference. What do you think?

@Enselic
Copy link
Contributor Author

Enselic commented Jun 5, 2022

Thanks for taking a look! Hmm not sure I follow completely. Are you talking about size_hint()? Its implementation is the same as in libstd, and I think it makes sense to behave the same, not least to follow the rule of least surprise: https://doc.rust-lang.org/src/std/collections/hash/set.rs.html#1670-1673

@Enselic
Copy link
Contributor Author

Enselic commented Jun 5, 2022

Wait I think I get what you mean now. You mean that the iterator maybe should produce a pair of both an entry and an occurrence count, right?

For my use case the behavior I implement is what I need. But perhaps both are needed. I would not be against some other contributor coming along later and adding the other variant. But personally I don't need it (at least not for now).

@Enselic
Copy link
Contributor Author

Enselic commented Jun 5, 2022

Reasoning further: Should I call this method something else? I think not, because I again invoke the rule of least surprise: it makes sense for it to have the same signature and semantics as its libstd counterpart.

@Enselic
Copy link
Contributor Author

Enselic commented Jun 6, 2022

... but I am of course happy to make any changes to this PR that you insist on 🙂

Enselic added a commit to cargo-public-api/cargo-public-api that referenced this pull request Jun 6, 2022
And add regression test called `no_off_by_one_diff_skewing()` that the
old algo does not pass.

This commit relies on a pending contribution to `hashbag`:
jonhoo/hashbag#6

To not be blocked on that contribution, I ported that code into our own
repo. We can remove that code when/if that contribution is made
available in a new `hashbag` release.

Closes #50
@jonhoo
Copy link
Owner

jonhoo commented Jun 7, 2022

My thinking was that you can pretty easily remove the count from a difference iterator that produces both the items and the difference in count, so it makes sense to just have that one rather than to have two copies of every single set operation method.

@Enselic
Copy link
Contributor Author

Enselic commented Jun 7, 2022

But why would people discard the occurrence count when using a HashBag? The point of using a HashBag is that the occurrence count is taken into account. If the occurrence count is not of interested, a HashSet can be used instead.

But I can try it out and show you the result and you can see what you think. But before I try it out I would like to ask you if you want difference() to still return an iterator in that case? An iterator over (&'a T, usize) is pretty inconvenient to work with. So I propose that in that case we make difference() return another HashBag instead. Does that sound OK to you?

In other words, I propose that the following example code that I added:

use hashbag::HashBag;

let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b).cloned().collect();
assert_eq!(expected, actual);

is changed to

use hashbag::HashBag;

let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b);
assert_eq!(expected, actual);

Having to work with an iterator over (&'a T, usize) from difference() in that example code becomes real messy, which to me is an indication of that the API is not ideal. The downside would be that then we require entries to implement Clone (contrary to vanilla HashSet::difference()), because limiting the lifetime of the resulting HashBag to the lifetime of the original lifetime would impose another unexpected and inconvenient limitation IMHO.

Looking forward to your response. Thank you for your patience.

@jonhoo
Copy link
Owner

jonhoo commented Jun 12, 2022

But why would people discard the occurrence count when using a HashBag? The point of using a HashBag is that the occurrence count is taken into account. If the occurrence count is not of interested, a HashSet can be used instead.

That's exactly why I think HashBag::difference should return an Iterator<Item = (&T, isize)> :)

An iterator over (&'a T, usize) is pretty inconvenient to work with.

What makes you say that? I mean, it is less convenient than just a &T, but it's also carrying more information, which the user explicitly wanted since they chose to use a HashBag over a HashSet in the first place (as you argued above). And turning Iterator<Item = (&T, isize)> into Iterator<Item = &T> is pretty simple: .map(|(t, _)| t) is all you need. It's true that that makes the particular example code you give a bit more verbose, but I don't think that indicates that real-world use-cases where you do need the occurrence difference less ergonomic.

So I propose that in that case we make difference() return another HashBag instead. Does that sound OK to you?

I'd like to avoid that since it forces an unnecessary allocation on the user. Unless, of course, we need to allocate a HashBag internally in order to compute the difference in the first place anyway, in which case returning one doesn't seem so bad.

@Enselic Enselic force-pushed the implement-difference branch from bf114d3 to 4b9fe77 Compare June 12, 2022 19:06
@Enselic
Copy link
Contributor Author

Enselic commented Jun 12, 2022

I played around a bit more with (&'a T, usize) and it indeed wasn't too bad. Let's go for that then. I pushed a commit now to make it so.

@Enselic
Copy link
Contributor Author

Enselic commented Jun 19, 2022

I have pushed a new commit now that takes care of the existing comments (but please see my follow-up question about the docs). Looking forward to a new review round.

@FedericoStra
Copy link
Contributor

FedericoStra commented Jun 19, 2022

Hey everybody, I just wanted to say that actually multiset difference has a well accepted definition as given on this wikipedia page: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations. Basically, self.difference(other) should produce all the items whose count inside self is strictly larger than in other, and the resulting count should be the difference between the two counts.

In formulas, x is an element of self.difference(other) iff self.contains(x) > other.contains(x) and in such case its count in the difference is self.contains(x) - other.contains(x).

Hence I would personally stick to the difference name instead of subtract, as it is already generally accepted. subtract could then be used for returning also the items with "negative count".

Edit: multisets in Python:

from collections import Counter

a = Counter([1,1,2])
b = Counter([1,2,2])

# multiset difference
a - b
# => Counter({1: 1})

# multiset subtraction
a.subtract(b)
a
# => Counter({1: 1, 2: -1})

@jonhoo
Copy link
Owner

jonhoo commented Jun 19, 2022

@FedericoStra That's interesting — I wonder why those Stack* posts then seemed to indicate that there was some ambiguity around what difference was actually defined as. Given that it seems pretty settled from the Wikipedia page though, I'm happy to re-use the proper terminology and go back to difference (@Enselic — sorry for the back-and-forth :) ).

@Enselic Enselic force-pushed the implement-difference branch from 93d05d5 to 03221ad Compare June 20, 2022 05:40
@Enselic
Copy link
Contributor Author

Enselic commented Jun 20, 2022

I have pushed another update, and I think all comments have been addressed now. Looking forward to another review round :)

@FedericoStra
Copy link
Contributor

FedericoStra commented Jun 20, 2022

As another remark, let me point out that if one desires difference can be succinctly implemented with a filter_map, returning an impl Iterator instead of a custom struct. Of course this does not allow implementing Debug on the returned type unfortunately, but might be deemed preferable in terms of conciseness and maintainability.

pub fn difference<'a>(&'a self, other: &'a HashBag<T, S>) -> impl Iterator<Item = (&'a T, usize)> + std::iter::FusedIterator {
    self.items.iter().filter_map(move |(x, &self_count)| {
        let other_count = other.contains(x);
        if self_count > other_count {
            Some((x, self_count - other_count))
        } else {
            None
        }
    })
}

I don't want to interfere with the choice, but simply make sure that this possible alternative is on the radar. :)

@Enselic
Copy link
Contributor Author

Enselic commented Jun 20, 2022

As another remark, let me point out that if one desires difference can be succinctly implemented with a filter_map, returning an impl Iterator instead of a custom struct. Of course this does not allow implementing Debug on the returned type unfortunately, but might be deemed preferable in terms of conciseness and maintainability.

Nice 🤩 This substantially simplifies the implementation and passes all tests, including the size_hint() test. See Enselic@e441a03. And who needs it to be Debug anyway? But I'll let Jon have the final say of course.

@ratmice
Copy link

ratmice commented Jun 20, 2022

Jonhoo wrote:

Given that it seems pretty settled from the Wikipedia page though, I'm happy to re-use the proper terminology and go back to difference (@Enselic — sorry for the back-and-forth :) ).

There is this little tidbit, about that definition in Remark 2
Complementation in multisets, D. Singh

Accordingly, some of the consequences of the aforesaid definition turn out to be disturbing. For example, if ... contradiciting the classical laws that ... therefore ... is only a lattice not a boolean algebra.

And from the conclusion of https://arxiv.org/abs/2110.12902

by allowing the subtraction of msets to take negative values. This paved the way for recovering several properties analogous to traditional sets involving the complement operation, including the De Morgan theorem.

@FedericoStra
Copy link
Contributor

FedericoStra commented Jun 20, 2022

Lengthy and slightly unrelated comment regarding the two articles linked by @ratmice

There is this little tidbit, about that definition in Remark 2 Complementation in multisets, D. Singh

If you read at the very beginning of Section 2.1, the authors rightfully make a clear distinction between multisets, whose codomain is ℕ, and a signed multiset, whose codomain is ℤ. If one is interested in defining the difference of two multisets as a multiset, there isn't much leeway left to get a reasonable definition. One may then decide to define also a signed_difference to get the corresponding signed multiset, but that brings you out of the multiset universe.


And from the conclusion of https://arxiv.org/abs/2110.12902

by allowing the subtraction of msets to take negative values. This paved the way for recovering several properties analogous to traditional sets involving the complement operation, including the De Morgan theorem.

He still defines the difference in the correct way in formula (24).

Also, reading Section 7 gives the impression that the author just discovered that he can compute the sum, difference, max and min of two functions... Just look at the beginning of page 8, where he says:

For instance, it becomes possible to write things such as:
r(x) = (g(x) ∩ h(x)) + g(x)
s(x) = (g(x) + h(x)) ∪ (g(x) − h(x))
t(x) = [g(x) ∩ h(x)] − [g(x) ∪ h(x)]

Yes of course it's possible, and has always been. Apart from the non-standard usage of ∩ and ∪ in place of the common ∧ and ∨ to indicate min and max respectively, what's new? This is something people have been doing for millennia. In fact, the whole rambling about "mfunctions" is quite empty: he is just treating "msets" as ordinary real valued function (which they are, although he does not write it anywhere!), nothing more. He seems to be unaware that what he thinks he is inventing is pretty much the concept of function.

I would urge you to find a single mathematician who says that this text on page 6 (with no accompanying formula) has any meaning whatsoever:

Now, we can approach the case of continuous functions and scalar fields simply by taking the separation between the points along the horizontal axis to the limit of 0. Observe that a homeomorphism is establsiehd between the function space and its representation in the multisets, as provided by the bijective associations between the function abscissa and the elements, therefore preserving the topology of the representation. In addition, all the operations required for the multisets apply irrespectively of the neighborhood of the abscissa neighborhood. Actually, it is hard to think of a bound function that will not map into the respective multiset elements. To any extent, the results in this work are understood to be applicable to mset representations of functions, i.e. mfunctions.

@ratmice
Copy link

ratmice commented Jun 20, 2022

@FedericoStra fwiw, the portion of Jon's comment I meant to reply to instead of what I actually quoted was:

I wonder why those Stack* posts then seemed to indicate that there was some ambiguity around what difference was actually defined as.

And Jon had mentioned signed values as well, and feel regardless of the quality they at least explain some of the ambiguity, as it is not really an unreasonable assumption that a function like difference follow same laws as the set theoretic difference function...

@jonhoo
Copy link
Owner

jonhoo commented Jun 22, 2022

But I'll let Jon have the final say of course.

I like returning impl Iterator here — and I agree the Debug implementation probably isn't super relevant on the iterator here anyway!

Someone separately made the argument to me that we should really provide an outer_join method for HashBags, because it'd allow implementing all these other things in terms of those. For example:

impl HashBag<T> {
    pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
        self.items.iter().map(|(x, &self_count)| {
            (x, self_count, other.contains(x))
        }).chain(other.iter().map(|(x, &other_count)| {
            (self.contains(x) == 0).then_some((x, self_count, other_count))
        }))
    }
    pub fn difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, sc, oc| (sc > oc).then_some((k, sc - oc)))
    }
    pub fn not_in<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, _, oc| (oc == 0).then_some((k, sc)))
    }
    pub fn signed_difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, isize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc as isize - oc as isize))
    }
}

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

Don't feel like you have to make that change in this PR though!

@Enselic
Copy link
Contributor Author

Enselic commented Jun 22, 2022

I like returning impl Iterator here — and I agree the Debug implementation probably isn't super relevant on the iterator here anyway!

Great! Commit pushed.

@jonhoo jonhoo merged commit 53ff43a into jonhoo:master Jun 22, 2022
@jonhoo
Copy link
Owner

jonhoo commented Jun 22, 2022

Published in 0.1.6 🎉

If you (or anyone else) wants to take a stab at outer_join and friends, please feel free!

@marcelogomez
Copy link
Contributor

Published in 0.1.6 🎉

If you (or anyone else) wants to take a stab at outer_join and friends, please feel free!

I just tried it out in #8 :)

@FedericoStra
Copy link
Contributor

@jonhoo

Someone separately made the argument to me that we should really provide an outer_join method for HashBags, because it'd allow implementing all these other things in terms of those.

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

I think for better performance one should use outer_join only for cases where items from both hashbags are needed (i.e. signed difference, sum, union...), whereas a simple iter is better suited for cases where the resulting hashbag is always a subset of self (difference, not_in).

In your `impl HashBag` you can add the following:

sum and union, according to https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations:

impl<T> HashBag<T> {
    pub fn sum<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc + oc))
    }
    pub fn union<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc.max(oc)))
    }
}

@jonhoo
Copy link
Owner

jonhoo commented Jun 27, 2022

@FedericoStra What makes you say it'd be better? You'd still need to look up in other for each item, which the outer_join iterator will do for you.

@FedericoStra
Copy link
Contributor

@jonhoo If I'm not mistaken, with outer_join

pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
    self.items.iter().map(|(x, &self_count)| {
        (x, self_count, other.contains(x))
    }).chain(other.iter().map(|(x, &other_count)| {
        (self.contains(x) == 0).then_some((x, self_count, other_count))
    }))
}

one iterates also over items which are contained exclusively in other (this is what chain(...) does), which is unnecessary when computing difference, intersection and not_in. When the support of the resulting multiset is a subset of the support of self, there is no need for outer_join, one can just iterate over the elements of self alone. Right?

@jonhoo
Copy link
Owner

jonhoo commented Jun 27, 2022

That's why the take_while is there in the things that don't need anything that's exclusively in other:

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

@FedericoStra
Copy link
Contributor

Yes you are right, for some reasons I misread the take_while as a filter, sorry.

But still, we know a priori that chain is unnecessary, but to "eliminate" it every item in self has to be repeatedly tested with |_, sc, _| sc > 0 to discover the end of the first iterator. Why not just use a plain simple self.items.iter()?

@jonhoo
Copy link
Owner

jonhoo commented Jun 28, 2022

Mostly to avoid duplicating the logic for doing the lookup into other. Also remember that the take_while optimizes to != 0 since it's unsigned anyway, and the .iter() on other and then consuming exactly one element is likely going to be very cheap. One thing we could do is have a private outer_join_self that just iterates over self with the lookup in other, and then have outer_join just do the chain. That way the methods that don't need to iterate over other can just call outer_join_self and skip the take_while. Happy to take a PR to that effect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants