Refactor Rabin-Karp #110

Bruce-Feldman · 2018-07-24T19:02:53Z

Created Rabin hash function module.
Transitioned Rabin-Karp string matching function to Rabin hash function family.

codecov-io · 2018-07-24T19:05:44Z

Codecov Report

Merging #110 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master   #110   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files         115    116    +1     
  Lines        2258   2260    +2     
  Branches      394    394           
=====================================
+ Hits         2258   2260    +2

Impacted Files	Coverage Δ
src/utils/hash/rolling/Rabin_Fingerprint.js	`100% <100%> (ø)`
src/algorithms/string/rabin-karp/rabinKarp.js	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d090f76...fab2d46. Read the comment docs.

Bruce-Feldman · 2018-07-25T15:52:01Z

I additionally think that Rabin-Karp should be changed to be a beginner topic.

dubzzz · 2018-07-25T17:38:25Z

Actually it might fix #102 as the hash function has been changed. I can re-run my test on this change to confirm it fixes the problem.

dubzzz · 2018-07-26T17:39:42Z

It seems that some values are still failing with this implementation. The characters outside the BMP-plan of Unicode make the algorithm fails:

rabinKarp("a\u{ffff}", "\u{ffff}"); // => OK - return: 1
rabinKarp("a\u{10000}", "\u{10000}"); // => FAIL - return: -1

// \u{10000} is LINEAR B SYLLABLE B008 A
// more at https://www.fileformat.info/info/unicode/char/10000/index.htm

See: https://runkit.com/dubzzz/rabin-2

Bruce-Feldman · 2018-07-26T17:53:37Z

This is a really cool bug. For some reason, on my machine, I get the following interesting result:

> var text = "a\u{10000}";
undefined
> text.length
3
> Array.from(text).length
2

I am not so familiar with the guarantees of these library functions, is this expected behaviour? With your first example, I get the following behaviour (which I would expect):

> var text = "a\u{1000}"
undefined
> text.length
2
> Array.from(text).length
2

dubzzz · 2018-07-26T17:59:41Z

This is an expected behavior. In JavaScript, strings are encoded using UTF-16 which encodes code units on 16 bits. Actually characters outside the BMP range, which means whose code point is greater than 0xffff, require two code units to encode one code point. Legacy methods of JavaScript are counting code units while modern ones (most of the time) work at code point level (spread operator, Array.from...).

dubzzz · 2018-07-26T18:06:19Z

If your code uses both s.length and Array.from(s) you must definitely choose one.

s.length for code point would be [...s].length or Array.from(s).length

Array.from(s) for code units would be s. split('')

dubzzz · 2018-07-26T18:11:10Z

Just an additional note:

'\u{10000}'.charCodeAt(0) // 0xd800 
'\u{10000}'.codePointAt(0) // 0x10000

Bruce-Feldman · 2018-07-26T19:10:16Z

@dubzzz I updated the string processing to handle this edge case. Could you take a look at it and see if it seems sufficient in the general case? I think I understood your what you were saying about javascript representation of strings, but am not sure.

dubzzz · 2018-07-26T19:45:48Z

@Bruce-Feldman Just retried based on the last code you pushed. It seems that there is still an issue:

// your implementation
console.log(rabinKarp("a耀a","耀a")); // 1
console.log(rabinKarp("\u0000耀\u0000","耀\u0000")); // -1 ERROR

// indexOf
console.log("a耀a".indexOf("耀a")); // 1
console.log("\u0000耀\u0000".indexOf("耀\u0000")); // 1

You can easily re-run my test case either by using RunKit or locally by adding fast-check package.
Snippet: https://runkit.com/dubzzz/rabin-3 (tell me if you can't access it)

Bruce-Feldman · 2018-07-26T19:54:08Z

Any recommendations to fix this problem in the general case?

Bruce-Feldman · 2018-07-26T20:15:40Z

@dubzzz Looks like the issue was that I forgot that javascript performs bit operations on signed 32 bit integers. This should be addressed now by using non bit operators.

dubzzz · 2018-07-26T20:32:17Z

With your last implementation, all my property based tests are green ;)

On my side, I think we should integrate such property based testing tools into the repository to have more confidence in the algorithms and cover all possible and legits corner cases. In the case of rabinKarb it helped to discover multiple issues: in the current and the new implementation.

Concerning general ways to cover those edge cases on character encodings I do not have real recommendations :/

@dubzzz

Incorporate tests from @dubzzz

Bruce-Feldman · 2018-07-26T20:44:49Z

@dubzzz Woohoo! I am really pleased with your tests - they seem to generate edge cases fairly reliably. Thanks for the help!

trekhleb · 2018-07-30T09:20:18Z

@Bruce-Feldman, @dubzzz thank you for PR and for new Rabing Karp testing edge cases!

@Bruce-Feldman I like the idea of moving hashing functionality out of RabinKarp. I even think that it should be moved not in utils but into separate section that will contain different hash implementations and maybe different crypto related algorithms. I'll merge your PR to separate branch for now and create this new section for hashing with related READMEs and links for further readings.

@dubzzz regarding the tool you've created that does property based testing - it is really nice I guess since it helped to find such edge cases for Rabing Karp as ^ !/\'#\'pp vs !/\'#\'pp'. But currently I'm not ready to answer your question about whether we'll have it in current repo or not. Let me investigate it further since I didn't have a chance to play with it so far.

dubzzz · 2018-07-30T11:05:19Z

@trekhleb Thanks for the update. Please keep me updated if you need more details on the approach. Meanwhile I will try to find some more times to checks other algorithms

trekhleb · 2018-07-30T11:09:36Z

Thank you @dubzzz

dubzzz

I believe that the use of Math.random should be replaced by an appropriate property based test. It will be:

reproducible
able to simplify the failure (in case there is one)
and this is exactly designed for this purpose

@dubzzz

* Simplify Rabin-Karp functionality * Created Rabin Fingerprinting module within util directory * Updated Rabin-Karp search to use rolling hash module Incorporate tests from @dubzzz

@dubzzz

* Simplify Rabin-Karp functionality * Created Rabin Fingerprinting module within util directory * Updated Rabin-Karp search to use rolling hash module Incorporate tests from @dubzzz

Refactor Rabin-Karp (trekhleb#110)

Bruce-Feldman force-pushed the master branch from 11367f2 to 593ac64 Compare July 25, 2018 14:25

Bruce-Feldman force-pushed the master branch from 593ac64 to 2863cb1 Compare July 26, 2018 19:08

Bruce-Feldman force-pushed the master branch from 2863cb1 to 0e5c2e0 Compare July 26, 2018 20:14

Bruce-Feldman force-pushed the master branch from 0e5c2e0 to b5b0788 Compare July 26, 2018 20:32

Bruce-Feldman closed this Jul 26, 2018

Bruce-Feldman reopened this Jul 26, 2018

Bruce-Feldman added 3 commits July 26, 2018 16:34

Simplify Rabin-Karp functionality

fbce153

Created Rabin Fingerprinting module within util directory

9683e60

Updated Rabin-Karp search to use rolling hash module

fab2d46

Incorporate tests from @dubzzz

Bruce-Feldman force-pushed the master branch from b5b0788 to fab2d46 Compare July 26, 2018 20:34

trekhleb changed the base branch from master to issue-102-rabin-karp-fix July 30, 2018 09:11

trekhleb merged commit c4605ea into trekhleb:issue-102-rabin-karp-fix Jul 30, 2018

dubzzz mentioned this pull request Jul 30, 2018

rabinKarp seems to miss some matches #102

Closed

dubzzz reviewed Nov 12, 2018

View reviewed changes

dubzzz mentioned this pull request Feb 14, 2019

Fixes bugs related to unicode support in polynomial-hash functions #304

Open

jixianu added a commit to jixianu/javascript-algorithms that referenced this pull request May 7, 2019

Merge pull request #2 from jixianu/issue-102-rabin-karp-fix

393980e

Refactor Rabin-Karp (trekhleb#110)

Uh oh!

Refactor Rabin-Karp #110

Refactor Rabin-Karp #110

Uh oh!

Conversation

Bruce-Feldman commented Jul 24, 2018

Uh oh!

codecov-io commented Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Bruce-Feldman commented Jul 25, 2018

Uh oh!

dubzzz commented Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

Bruce-Feldman commented Jul 26, 2018

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

Bruce-Feldman commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

Bruce-Feldman commented Jul 26, 2018

Uh oh!

Bruce-Feldman commented Jul 26, 2018

Uh oh!

dubzzz commented Jul 26, 2018

Uh oh!

Bruce-Feldman commented Jul 26, 2018

Uh oh!

trekhleb commented Jul 30, 2018

Uh oh!

dubzzz commented Jul 30, 2018

Uh oh!

trekhleb commented Jul 30, 2018

Uh oh!

dubzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Jul 24, 2018 •

edited

Loading

dubzzz commented Jul 25, 2018 •

edited

Loading

Bruce-Feldman commented Jul 26, 2018 •

edited

Loading