Implement `HashBag::difference(&self, other: &HashBag)` #6

Enselic · 2022-06-05T07:35:51Z

This corresponds to HashSet::difference(), and has the same semantics, and is implemented in a similar manner.

This PR has the following impact on the public API of hashbag:

% cargo install cargo-public-api
% cargo public-api --diff-git-checkouts origin/master implement-difference
Removed items from the public API
=================================
(none)

Changed items in the public API
===============================
(none)

Added items to the public API
=============================
+pub fn hashbag::Difference::fmt(&self, fmt: &mut fmt::Formatter<'_>) -> fmt::Result
+pub fn hashbag::Difference::next(&mut self) -> Option<&'a T>
+pub fn hashbag::Difference::size_hint(&self) -> (usize, Option<usize>)
+pub fn hashbag::HashBag::difference<'a>(&'a self, other: &'a HashBag<T, S>) -> Difference<'a, T, S>
+pub struct hashbag::Difference<'a, T, S>
+pub type hashbag::Difference::Item = &'a T

This corresponds to [`HashSet::difference()`](https://doc.rust-lang.org/std/collections/hash_set/struct.HashSet.html#method.difference), and has the same semantics, and is implemented in a similar manner. This has the following impact on the public API of `hashbag`: ``` % cargo install cargo-public-api % cargo public-api --diff-git-checkouts origin/master implement-difference Removed items from the public API ================================= (none) Changed items in the public API =============================== (none) Added items to the public API ============================= +pub fn hashbag::Difference::fmt(&self, fmt: &mut fmt::Formatter<'_>) -> fmt::Result +pub fn hashbag::Difference::next(&mut self) -> Option<&'a T> +pub fn hashbag::Difference::size_hint(&self) -> (usize, Option<usize>) +pub fn hashbag::HashBag::difference<'a>(&'a self, other: &'a HashBag<T, S>) -> Difference<'a, T, S> +pub struct hashbag::Difference<'a, T, S> +pub type hashbag::Difference::Item = &'a T ```

codecov · 2022-06-05T07:38:45Z

Codecov Report

Merging #6 (70ff9c1) into master (75e9f0d) will decrease coverage by 0.7%.
The diff coverage is 100.0%.

Impacted Files	Coverage Δ
src/lib.rs	`92.6% <100.0%> (-2.6%)`	⬇️

Enselic · 2022-06-05T08:50:27Z

Nice setup with code coverage reporting 🤩 It seems to be a bit unreliable though, because the two lines in my diff it reports as not covered clearly are covered. So as far as I can tell this PR diff is now at 100% code coverage.

jonhoo · 2022-06-05T17:33:29Z

Oh, don't worry about the coverage — all good if the measurement is a little off!

I wonder whether we want Difference to produce the count of the difference and not just the set of items in the difference. What do you think?

Enselic · 2022-06-05T17:53:20Z

Thanks for taking a look! Hmm not sure I follow completely. Are you talking about size_hint()? Its implementation is the same as in libstd, and I think it makes sense to behave the same, not least to follow the rule of least surprise: https://doc.rust-lang.org/src/std/collections/hash/set.rs.html#1670-1673

Enselic · 2022-06-05T17:57:26Z

Wait I think I get what you mean now. You mean that the iterator maybe should produce a pair of both an entry and an occurrence count, right?

For my use case the behavior I implement is what I need. But perhaps both are needed. I would not be against some other contributor coming along later and adding the other variant. But personally I don't need it (at least not for now).

Enselic · 2022-06-05T18:19:10Z

Reasoning further: Should I call this method something else? I think not, because I again invoke the rule of least surprise: it makes sense for it to have the same signature and semantics as its libstd counterpart.

Enselic · 2022-06-06T06:06:00Z

... but I am of course happy to make any changes to this PR that you insist on 🙂

And add regression test called `no_off_by_one_diff_skewing()` that the old algo does not pass. This commit relies on a pending contribution to `hashbag`: jonhoo/hashbag#6 To not be blocked on that contribution, I ported that code into our own repo. We can remove that code when/if that contribution is made available in a new `hashbag` release. Closes #50

jonhoo · 2022-06-07T15:41:14Z

My thinking was that you can pretty easily remove the count from a difference iterator that produces both the items and the difference in count, so it makes sense to just have that one rather than to have two copies of every single set operation method.

Enselic · 2022-06-07T16:58:13Z

But why would people discard the occurrence count when using a HashBag? The point of using a HashBag is that the occurrence count is taken into account. If the occurrence count is not of interested, a HashSet can be used instead.

But I can try it out and show you the result and you can see what you think. But before I try it out I would like to ask you if you want difference() to still return an iterator in that case? An iterator over (&'a T, usize) is pretty inconvenient to work with. So I propose that in that case we make difference() return another HashBag instead. Does that sound OK to you?

In other words, I propose that the following example code that I added:

use hashbag::HashBag;

let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b).cloned().collect();
assert_eq!(expected, actual);

is changed to

use hashbag::HashBag;

let a: HashBag<_> = [1, 2, 3, 3].iter().cloned().collect();
let b: HashBag<_> = [2, 3].iter().cloned().collect();
let expected: HashBag<_> = [1, 3].iter().cloned().collect();
let actual: HashBag<_> = a.difference(&b);
assert_eq!(expected, actual);

Having to work with an iterator over (&'a T, usize) from difference() in that example code becomes real messy, which to me is an indication of that the API is not ideal. The downside would be that then we require entries to implement Clone (contrary to vanilla HashSet::difference()), because limiting the lifetime of the resulting HashBag to the lifetime of the original lifetime would impose another unexpected and inconvenient limitation IMHO.

Looking forward to your response. Thank you for your patience.

jonhoo · 2022-06-12T17:16:42Z

But why would people discard the occurrence count when using a HashBag? The point of using a HashBag is that the occurrence count is taken into account. If the occurrence count is not of interested, a HashSet can be used instead.

That's exactly why I think HashBag::difference should return an Iterator<Item = (&T, isize)> :)

An iterator over (&'a T, usize) is pretty inconvenient to work with.

What makes you say that? I mean, it is less convenient than just a &T, but it's also carrying more information, which the user explicitly wanted since they chose to use a HashBag over a HashSet in the first place (as you argued above). And turning Iterator<Item = (&T, isize)> into Iterator<Item = &T> is pretty simple: .map(|(t, _)| t) is all you need. It's true that that makes the particular example code you give a bit more verbose, but I don't think that indicates that real-world use-cases where you do need the occurrence difference less ergonomic.

So I propose that in that case we make difference() return another HashBag instead. Does that sound OK to you?

I'd like to avoid that since it forces an unnecessary allocation on the user. Unless, of course, we need to allocate a HashBag internally in order to compute the difference in the first place anyway, in which case returning one doesn't seem so bad.

Enselic · 2022-06-12T19:12:50Z

I played around a bit more with (&'a T, usize) and it indeed wasn't too bad. Let's go for that then. I pushed a commit now to make it so.

src/lib.rs

Enselic · 2022-06-19T09:13:32Z

I have pushed a new commit now that takes care of the existing comments (but please see my follow-up question about the docs). Looking forward to a new review round.

src/lib.rs

FedericoStra · 2022-06-19T21:18:16Z

Hey everybody, I just wanted to say that actually multiset difference has a well accepted definition as given on this wikipedia page: https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations. Basically, self.difference(other) should produce all the items whose count inside self is strictly larger than in other, and the resulting count should be the difference between the two counts.

In formulas, x is an element of self.difference(other) iff self.contains(x) > other.contains(x) and in such case its count in the difference is self.contains(x) - other.contains(x).

Hence I would personally stick to the difference name instead of subtract, as it is already generally accepted. subtract could then be used for returning also the items with "negative count".

Edit: multisets in Python:

from collections import Counter

a = Counter([1,1,2])
b = Counter([1,2,2])

# multiset difference
a - b
# => Counter({1: 1})

# multiset subtraction
a.subtract(b)
a
# => Counter({1: 1, 2: -1})

jonhoo · 2022-06-19T23:49:09Z

@FedericoStra That's interesting — I wonder why those Stack* posts then seemed to indicate that there was some ambiguity around what difference was actually defined as. Given that it seems pretty settled from the Wikipedia page though, I'm happy to re-use the proper terminology and go back to difference (@Enselic — sorry for the back-and-forth :) ).

src/lib.rs

Enselic · 2022-06-20T06:01:32Z

I have pushed another update, and I think all comments have been addressed now. Looking forward to another review round :)

FedericoStra · 2022-06-20T08:55:53Z

As another remark, let me point out that if one desires difference can be succinctly implemented with a filter_map, returning an impl Iterator instead of a custom struct. Of course this does not allow implementing Debug on the returned type unfortunately, but might be deemed preferable in terms of conciseness and maintainability.

pub fn difference<'a>(&'a self, other: &'a HashBag<T, S>) -> impl Iterator<Item = (&'a T, usize)> + std::iter::FusedIterator {
    self.items.iter().filter_map(move |(x, &self_count)| {
        let other_count = other.contains(x);
        if self_count > other_count {
            Some((x, self_count - other_count))
        } else {
            None
        }
    })
}

I don't want to interfere with the choice, but simply make sure that this possible alternative is on the radar. :)

Enselic · 2022-06-20T09:08:34Z

As another remark, let me point out that if one desires difference can be succinctly implemented with a filter_map, returning an impl Iterator instead of a custom struct. Of course this does not allow implementing Debug on the returned type unfortunately, but might be deemed preferable in terms of conciseness and maintainability.

Nice 🤩 This substantially simplifies the implementation and passes all tests, including the size_hint() test. See Enselic@e441a03. And who needs it to be Debug anyway? But I'll let Jon have the final say of course.

ratmice · 2022-06-20T11:13:28Z

Jonhoo wrote:

Given that it seems pretty settled from the Wikipedia page though, I'm happy to re-use the proper terminology and go back to difference (@Enselic — sorry for the back-and-forth :) ).

There is this little tidbit, about that definition in Remark 2
Complementation in multisets, D. Singh

Accordingly, some of the consequences of the aforesaid definition turn out to be disturbing. For example, if ... contradiciting the classical laws that ... therefore ... is only a lattice not a boolean algebra.

And from the conclusion of https://arxiv.org/abs/2110.12902

by allowing the subtraction of msets to take negative values. This paved the way for recovering several properties analogous to traditional sets involving the complement operation, including the De Morgan theorem.

FedericoStra · 2022-06-20T12:11:59Z

Lengthy and slightly unrelated comment regarding the two articles linked by @ratmice

There is this little tidbit, about that definition in Remark 2 Complementation in multisets, D. Singh

If you read at the very beginning of Section 2.1, the authors rightfully make a clear distinction between multisets, whose codomain is ℕ, and a signed multiset, whose codomain is ℤ. If one is interested in defining the difference of two multisets as a multiset, there isn't much leeway left to get a reasonable definition. One may then decide to define also a signed_difference to get the corresponding signed multiset, but that brings you out of the multiset universe.

And from the conclusion of https://arxiv.org/abs/2110.12902

by allowing the subtraction of msets to take negative values. This paved the way for recovering several properties analogous to traditional sets involving the complement operation, including the De Morgan theorem.

He still defines the difference in the correct way in formula (24).

Also, reading Section 7 gives the impression that the author just discovered that he can compute the sum, difference, max and min of two functions... Just look at the beginning of page 8, where he says:

For instance, it becomes possible to write things such as:
r(x) = (g(x) ∩ h(x)) + g(x)
s(x) = (g(x) + h(x)) ∪ (g(x) − h(x))
t(x) = [g(x) ∩ h(x)] − [g(x) ∪ h(x)]

Yes of course it's possible, and has always been. Apart from the non-standard usage of ∩ and ∪ in place of the common ∧ and ∨ to indicate min and max respectively, what's new? This is something people have been doing for millennia. In fact, the whole rambling about "mfunctions" is quite empty: he is just treating "msets" as ordinary real valued function (which they are, although he does not write it anywhere!), nothing more. He seems to be unaware that what he thinks he is inventing is pretty much the concept of function.

I would urge you to find a single mathematician who says that this text on page 6 (with no accompanying formula) has any meaning whatsoever:

Now, we can approach the case of continuous functions and scalar fields simply by taking the separation between the points along the horizontal axis to the limit of 0. Observe that a homeomorphism is establsiehd between the function space and its representation in the multisets, as provided by the bijective associations between the function abscissa and the elements, therefore preserving the topology of the representation. In addition, all the operations required for the multisets apply irrespectively of the neighborhood of the abscissa neighborhood. Actually, it is hard to think of a bound function that will not map into the respective multiset elements. To any extent, the results in this work are understood to be applicable to mset representations of functions, i.e. mfunctions.

ratmice · 2022-06-20T12:27:48Z

@FedericoStra fwiw, the portion of Jon's comment I meant to reply to instead of what I actually quoted was:

I wonder why those Stack* posts then seemed to indicate that there was some ambiguity around what difference was actually defined as.

And Jon had mentioned signed values as well, and feel regardless of the quality they at least explain some of the ambiguity, as it is not really an unreasonable assumption that a function like difference follow same laws as the set theoretic difference function...

jonhoo · 2022-06-22T03:29:23Z

But I'll let Jon have the final say of course.

I like returning impl Iterator here — and I agree the Debug implementation probably isn't super relevant on the iterator here anyway!

Someone separately made the argument to me that we should really provide an outer_join method for HashBags, because it'd allow implementing all these other things in terms of those. For example:

impl HashBag<T> {
    pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
        self.items.iter().map(|(x, &self_count)| {
            (x, self_count, other.contains(x))
        }).chain(other.iter().map(|(x, &other_count)| {
            (self.contains(x) == 0).then_some((x, self_count, other_count))
        }))
    }
    pub fn difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, sc, oc| (sc > oc).then_some((k, sc - oc)))
    }
    pub fn not_in<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).take_while(|_, sc, _| sc > 0).filter_map(|k, _, oc| (oc == 0).then_some((k, sc)))
    }
    pub fn signed_difference<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, isize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc as isize - oc as isize))
    }
}

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

Don't feel like you have to make that change in this PR though!

Instead of a struct. Co-authored-by: Federico Stra <[email protected]>

Enselic · 2022-06-22T05:50:16Z

I like returning impl Iterator here — and I agree the Debug implementation probably isn't super relevant on the iterator here anyway!

Great! Commit pushed.

jonhoo · 2022-06-22T15:29:34Z

Published in 0.1.6 🎉

If you (or anyone else) wants to take a stab at outer_join and friends, please feel free!

marcelogomez · 2022-06-25T17:24:26Z

Published in 0.1.6 🎉

If you (or anyone else) wants to take a stab at outer_join and friends, please feel free!

I just tried it out in #8 :)

FedericoStra · 2022-06-27T08:43:11Z

@jonhoo

Someone separately made the argument to me that we should really provide an outer_join method for HashBags, because it'd allow implementing all these other things in terms of those.

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

I think for better performance one should use outer_join only for cases where items from both hashbags are needed (i.e. signed difference, sum, union...), whereas a simple iter is better suited for cases where the resulting hashbag is always a subset of self (difference, not_in).

In your `impl HashBag` you can add the following:

sum and union, according to https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations:

impl<T> HashBag<T> {
    pub fn sum<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc + oc))
    }
    pub fn union<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize)> {
        self.outer_join(other).map(|k, sc, oc| (k, sc.max(oc)))
    }
}

jonhoo · 2022-06-27T15:44:47Z

@FedericoStra What makes you say it'd be better? You'd still need to look up in other for each item, which the outer_join iterator will do for you.

FedericoStra · 2022-06-27T16:43:00Z

@jonhoo If I'm not mistaken, with outer_join

pub fn outer_join<'a>(&'a self, other: &'a Self) -> impl Iterator<Item = (&'a T, usize, usize)> {
    self.items.iter().map(|(x, &self_count)| {
        (x, self_count, other.contains(x))
    }).chain(other.iter().map(|(x, &other_count)| {
        (self.contains(x) == 0).then_some((x, self_count, other_count))
    }))
}

one iterates also over items which are contained exclusively in other (this is what chain(...) does), which is unnecessary when computing difference, intersection and not_in. When the support of the resulting multiset is a subset of the support of self, there is no need for outer_join, one can just iterate over the elements of self alone. Right?

jonhoo · 2022-06-27T21:07:19Z

That's why the take_while is there in the things that don't need anything that's exclusively in other:

(the take_while is needed so that we don't unnecessarily iterate over items that are only in other when the semantics of the operations do not require it)

FedericoStra · 2022-06-27T21:16:06Z

Yes you are right, for some reasons I misread the take_while as a filter, sorry.

But still, we know a priori that chain is unnecessary, but to "eliminate" it every item in self has to be repeatedly tested with |_, sc, _| sc > 0 to discover the end of the first iterator. Why not just use a plain simple self.items.iter()?

jonhoo · 2022-06-28T01:04:40Z

Mostly to avoid duplicating the logic for doing the lookup into other. Also remember that the take_while optimizes to != 0 since it's unsigned anyway, and the .iter() on other and then consuming exactly one element is likely going to be very cheap. One thing we could do is have a private outer_join_self that just iterates over self with the lookup in other, and then have outer_join just do the chain. That way the methods that don't need to iterate over other can just call outer_join_self and skip the take_while. Happy to take a PR to that effect!

Boost Difference code coverage

e70be47

Enselic force-pushed the implement-difference branch from 5b11234 to e70be47 Compare June 5, 2022 08:13

Enselic mentioned this pull request Jun 6, 2022

Rewrite diff algo with hashbag to fix off-by-one skewing cargo-public-api/cargo-public-api#53

Merged

Make HashBag::difference() iterate over (&'a T, usize)

4b9fe77

Enselic force-pushed the implement-difference branch from bf114d3 to 4b9fe77 Compare June 12, 2022 19:06

jonhoo requested changes Jun 19, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

Dr-Emann reviewed Jun 19, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Rename difference() to subtract() and make size_hint() dynamic

5734de6

Dr-Emann reviewed Jun 19, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Lower bound on Subtract::size_hint() must be 0

e7797c4

jonhoo reviewed Jun 19, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Enselic added 3 commits June 20, 2022 07:19

Rename subtract() back to difference() again

bdba808

Improve test of Difference::size_hint()

217ef6f

Improve HashBag::difference() doc wording

03221ad

Enselic force-pushed the implement-difference branch from 93d05d5 to 03221ad Compare June 20, 2022 05:40

Implement HashBag::difference() with an anonymous impl Iterator

70ff9c1

Instead of a struct. Co-authored-by: Federico Stra <[email protected]>

jonhoo approved these changes Jun 22, 2022

View reviewed changes

jonhoo merged commit 53ff43a into jonhoo:master Jun 22, 2022

Enselic deleted the implement-difference branch June 22, 2022 16:04

Enselic mentioned this pull request Jun 22, 2022

Upgrade hashbag to 0.1.6 and remove vendored code cargo-public-api/cargo-public-api#60

Merged

marcelogomez mentioned this pull request Jun 25, 2022

Implement outer join and friends #8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `HashBag::difference(&self, other: &HashBag)` #6

Implement `HashBag::difference(&self, other: &HashBag)` #6

Enselic commented Jun 5, 2022

codecov bot commented Jun 5, 2022 •

edited

Loading

Enselic commented Jun 5, 2022

jonhoo commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 6, 2022

jonhoo commented Jun 7, 2022

Enselic commented Jun 7, 2022 •

edited

Loading

jonhoo commented Jun 12, 2022

Enselic commented Jun 12, 2022 •

edited

Loading

Enselic commented Jun 19, 2022 •

edited

Loading

FedericoStra commented Jun 19, 2022 •

edited

Loading

jonhoo commented Jun 19, 2022

Enselic commented Jun 20, 2022

FedericoStra commented Jun 20, 2022 •

edited

Loading

Enselic commented Jun 20, 2022 •

edited

Loading

ratmice commented Jun 20, 2022 •

edited

Loading

FedericoStra commented Jun 20, 2022 •

edited

Loading

ratmice commented Jun 20, 2022

jonhoo commented Jun 22, 2022

Enselic commented Jun 22, 2022

jonhoo commented Jun 22, 2022

marcelogomez commented Jun 25, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 27, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 27, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 28, 2022

Implement HashBag::difference(&self, other: &HashBag) #6

Implement HashBag::difference(&self, other: &HashBag) #6

Conversation

Enselic commented Jun 5, 2022

codecov bot commented Jun 5, 2022 • edited Loading

Codecov Report

Enselic commented Jun 5, 2022

jonhoo commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 5, 2022

Enselic commented Jun 6, 2022

jonhoo commented Jun 7, 2022

Enselic commented Jun 7, 2022 • edited Loading

jonhoo commented Jun 12, 2022

Enselic commented Jun 12, 2022 • edited Loading

Enselic commented Jun 19, 2022 • edited Loading

FedericoStra commented Jun 19, 2022 • edited Loading

jonhoo commented Jun 19, 2022

Enselic commented Jun 20, 2022

FedericoStra commented Jun 20, 2022 • edited Loading

Enselic commented Jun 20, 2022 • edited Loading

ratmice commented Jun 20, 2022 • edited Loading

FedericoStra commented Jun 20, 2022 • edited Loading

ratmice commented Jun 20, 2022

jonhoo commented Jun 22, 2022

Enselic commented Jun 22, 2022

jonhoo commented Jun 22, 2022

marcelogomez commented Jun 25, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 27, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 27, 2022

FedericoStra commented Jun 27, 2022

jonhoo commented Jun 28, 2022

Implement `HashBag::difference(&self, other: &HashBag)` #6

Implement `HashBag::difference(&self, other: &HashBag)` #6

codecov bot commented Jun 5, 2022 •

edited

Loading

Enselic commented Jun 7, 2022 •

edited

Loading

Enselic commented Jun 12, 2022 •

edited

Loading

Enselic commented Jun 19, 2022 •

edited

Loading

FedericoStra commented Jun 19, 2022 •

edited

Loading

FedericoStra commented Jun 20, 2022 •

edited

Loading

Enselic commented Jun 20, 2022 •

edited

Loading

ratmice commented Jun 20, 2022 •

edited

Loading

FedericoStra commented Jun 20, 2022 •

edited

Loading