-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Rabin-Karp #110
Refactor Rabin-Karp #110
Conversation
Codecov Report
@@ Coverage Diff @@
## master #110 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 115 116 +1
Lines 2258 2260 +2
Branches 394 394
=====================================
+ Hits 2258 2260 +2
Continue to review full report at Codecov.
|
I additionally think that Rabin-Karp should be changed to be a beginner topic. |
Actually it might fix #102 as the hash function has been changed. I can re-run my test on this change to confirm it fixes the problem. |
It seems that some values are still failing with this implementation. The characters outside the BMP-plan of Unicode make the algorithm fails: rabinKarp("a\u{ffff}", "\u{ffff}"); // => OK - return: 1
rabinKarp("a\u{10000}", "\u{10000}"); // => FAIL - return: -1
// \u{10000} is LINEAR B SYLLABLE B008 A
// more at https://www.fileformat.info/info/unicode/char/10000/index.htm |
This is a really cool bug. For some reason, on my machine, I get the following interesting result:
I am not so familiar with the guarantees of these library functions, is this expected behaviour? With your first example, I get the following behaviour (which I would expect):
|
This is an expected behavior. In JavaScript, strings are encoded using UTF-16 which encodes code units on 16 bits. Actually characters outside the BMP range, which means whose code point is greater than 0xffff, require two code units to encode one code point. Legacy methods of JavaScript are counting code units while modern ones (most of the time) work at code point level (spread operator, Array.from...). |
If your code uses both
|
Just an additional note: '\u{10000}'.charCodeAt(0) // 0xd800
'\u{10000}'.codePointAt(0) // 0x10000 |
@dubzzz I updated the string processing to handle this edge case. Could you take a look at it and see if it seems sufficient in the general case? I think I understood your what you were saying about javascript representation of strings, but am not sure. |
@Bruce-Feldman Just retried based on the last code you pushed. It seems that there is still an issue: // your implementation
console.log(rabinKarp("a耀a","耀a")); // 1
console.log(rabinKarp("\u0000耀\u0000","耀\u0000")); // -1 ERROR
// indexOf
console.log("a耀a".indexOf("耀a")); // 1
console.log("\u0000耀\u0000".indexOf("耀\u0000")); // 1 You can easily re-run my test case either by using RunKit or locally by adding fast-check package. |
Any recommendations to fix this problem in the general case? |
@dubzzz Looks like the issue was that I forgot that javascript performs bit operations on signed 32 bit integers. This should be addressed now by using non bit operators. |
With your last implementation, all my property based tests are green ;) On my side, I think we should integrate such property based testing tools into the repository to have more confidence in the algorithms and cover all possible and legits corner cases. In the case of Concerning general ways to cover those edge cases on character encodings I do not have real recommendations :/ |
@dubzzz Woohoo! I am really pleased with your tests - they seem to generate edge cases fairly reliably. Thanks for the help! |
@Bruce-Feldman, @dubzzz thank you for PR and for new Rabing Karp testing edge cases! @Bruce-Feldman I like the idea of moving hashing functionality out of RabinKarp. I even think that it should be moved not in @dubzzz regarding the tool you've created that does property based testing - it is really nice I guess since it helped to find such edge cases for Rabing Karp as |
@trekhleb Thanks for the update. Please keep me updated if you need more details on the approach. Meanwhile I will try to find some more times to checks other algorithms |
Thank you @dubzzz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the use of Math.random
should be replaced by an appropriate property based test. It will be:
- reproducible
- able to simplify the failure (in case there is one)
- and this is exactly designed for this purpose
* Simplify Rabin-Karp functionality * Created Rabin Fingerprinting module within util directory * Updated Rabin-Karp search to use rolling hash module Incorporate tests from @dubzzz
* Simplify Rabin-Karp functionality * Created Rabin Fingerprinting module within util directory * Updated Rabin-Karp search to use rolling hash module Incorporate tests from @dubzzz
Refactor Rabin-Karp (trekhleb#110)
Created Rabin hash function module.
Transitioned Rabin-Karp string matching function to Rabin hash function family.