-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
escape thing inside of code tag #2337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different than what I thought we were discussing. I thought you were going to submit a PR to change the options we pass to the sanitizer so it escapes vs. deletes unknown tags.
I don't think running regexes on the HTML is a good idea.
But adding the escape will convert tag like |
I want to understand the underlying problem here vs. fix one specific case. Consider:
The example in 1. is not valid HTML5, only 2. is. Why are we being dealing with invalid HTML in the first place? Are we being sent it? Or is our feed pipeline producing it somehow? In the case of your post, I see that we get this: I think the underlying issue is the fact that the sanitizer is forcing entities to be decoded, see fb55/htmlparser2#105. Undoing this has some interesting security implications. |
@Kevan-Y any updates? |
I think this is a hard problem. My gut says that we should turn off automatic entity decoding in the sanitizer, which is on by default to catch some edge case bugs. Because we pull content from existing systems vs. ingesting user-defined HTML directly, it's probably safe to accept what comes in and just strip the bits we don't want (e.g., scripts, etc). |
Let's try an experiment to change our HTML sanitizer to stop encoding automatically:
|
Update: telescope/src/backend/utils/html/replace-entities.js Lines 5 to 6 in bce0b21
Looked at the feed information that we get when something is inside of the code tag, it is in escaped character already, but because of the decode HTML. It turns into an actual character. Which produces turning to actual HTML. By removing replace-entities.js, the current issue will be fixed but this issue #1091 will be reproducible. Traced back where it is causing
I think when we pass our code.innerHTML to the hljs.highlightAuto() It's changing for example > to > .So maybe having a regex at the end
to change it back should be enough to fix that bug. |
// Replace the contents with newly marked up syntax highlighting | ||
code.innerHTML = value; | ||
code.innerHTML = value.replace(/\&([a-z]*);/gm, '&$1'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML entities can be a lot more than just a-z
. Maybe this should be:
/\&([^;]+);/gm
In this case, it matches 1 or more non-semi-colon characters before a semi-colon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if &
is present on its own in the post? Won't this destroy it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the regex, added one more condition if the user input &≶
it will stay the same. The change only apply for thing e.g, <
or >
...
@@ -54,7 +54,8 @@ module.exports = function (dom) { | |||
pre.classList.add('hljs', language); | |||
} | |||
|
|||
// Convert & followed by another escape character to a single escape character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs more details. It's odd that we're doing this, so we should explain why we have chosen to do it, and what it is fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -53,8 +53,9 @@ module.exports = function (dom) { | |||
if (pre) { | |||
pre.classList.add('hljs', language); | |||
} | |||
|
|||
// Value from hljs.highlightAuto turn escape character to e.g < to &lt; | |||
// Adding regex to convert & of that escape character back to & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, here is what they do: https://github.com/highlightjs/highlight.js/blob/7c87adcc39fed1754e9c85c2e7f744b560acd24e/src/lib/regex.js#L5-L7
// Replace the contents with newly marked up syntax highlighting | ||
code.innerHTML = value; | ||
code.innerHTML = value.replace(/\&([^;|^&]+);/gm, '&$1'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got two cases here: not ;
and not &
. Can we get test coverage of both?
Also, there are about 8 issues filed on this, which reference various blog posts in the wild that have broken. Can you create some more tests based on those cases? It would be good to know that your code fixes them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some test cases. Some issues filed are duplicated from others so combined into 1 test case.
You've got two cases here: not ; and not &. Can we get test coverage of both?
Not sure if it is possible I can test for like &&
or ;&lt;
to stay the same when calling the whole process but not test &lt;
to turn <
by calling the whole process since this is happening inside of syntax-highlight.js with value. Like for example if we pass &lt;
to hljs.highlightAuto(code.innerHTML);
it will turn to &amp;lt;
then with the regex it will turned back to &lt;
We can cover that if we create a test case to test this specific regex. But I don't think we need it
@Kevan-Y can you rebase and fix the conflict please? |
bde3c64
to
6e6be2f
Compare
6e6be2f
to
6ba81f6
Compare
Rebased the branch to master, squash commit together to make it looks nicer, and fixed conflict |
Did you rebase onto the right commit? Looks like you've got issues with removeDevToFullscreen, which I think we ripped out quite a while ago.
|
Sorry about that. I fixed and ran the test. All should be good. |
6ba81f6
to
70ee609
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to try this and see what happens, but I'd like to think on it a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve this and we'll test/fix based on what we see on staging.
Okay, so, after reading the original issue, the point of this PR is to fix up why things in a < code > tag were not previously loading in. And part of the problem before was our sanitization. The fix is to change some of the 'problematic' values which occur inparticular & > <? |
@Kevan-Y can give more details |
@manekenpix The issue was our current sanitization decode escape character inside of a The issue came from a code fix introduced here: telescope/src/backend/utils/html/replace-entities.js Lines 5 to 6 in bce0b21
This was to fix this issue #1091 So if we remove
to change it back should be enough to fix that bug. |
Okay awesome. |
If you can just rebase it onto master. So we can merge it should be good to go :). |
OK, let's try this. @Kevan-Y we'll open further issues to tweak it when we see what happens on staging. |
Issue This PR Addresses
Fixes #2322
Type of Change
Description
Removed
replace-entities.js
to fix our current issue but will re-introduce #1091.Added a regex at end of
telescope/src/backend/utils/html/syntax-highlight.js
Line 58 in bce0b21
&gt;
back to>
Checklist