Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The subpath normalization algorithm should follow RFC 3986 #405

Open
dwalluck opened this issue Mar 4, 2025 · 5 comments
Open

The subpath normalization algorithm should follow RFC 3986 #405

dwalluck opened this issue Mar 4, 2025 · 5 comments
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL subpath component

Comments

@dwalluck
Copy link

dwalluck commented Mar 4, 2025

Simply dropping '..' from the output leads to the case where the normal (human) interpretation of a path containing two dots differs from what the spec says.

For example, the normal interpretation of the fragment of a URI like pkg:golang/google.golang.org/genproto#googleapis/../api/annotations is googleapis/annotations, not googleapis/api/annotations as in the spec.

If someone provides that URI as input, they are most likely intending the later version to be the output.

I am suggesting just following https://datatracker.ietf.org/doc/html/rfc3986#section-5.2.4 instead so that the usual meaning is preserved.

@johnmhoran johnmhoran added PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL subpath component Ecma specification Work on the core specification labels Mar 4, 2025
@jkowalleck jkowalleck assigned jkowalleck and unassigned jkowalleck Mar 5, 2025
@jkowalleck
Copy link
Member

jkowalleck commented Mar 5, 2025

The rule is simple: for all path segments: path segment MUST NOT be .. nor .. That is the whole point.
There are many ways how to reach this, and dropping these segments is the easiest one.

I personally don't think this is confusion at all.
Any ../. are dropped, because they are simply forbidden in the first place. They must not exist, and there is no reason they were in the input in the first place.
Basically, every PURL with a ../. is an invalid purl. (the one who produced them is in charge if "fixing" them - if intended. they were tasked with not having ../. in the purl in the first place.)

It is a question of the robustness principle to still accept these invalid PURLS and make the best out of it.
And it is completely valid to simply drop the unacceptable parts, instead of trying to make sense out of them - which would cause resource use for a "nice" process, that is working this invalid input despite knowing that it could simply deny the work.

@dwalluck
Copy link
Author

dwalluck commented Mar 5, 2025

To me , "dropped" is not the same as "forbidden". "Dropped" means it's allowed, but ignored whereas "forbidden" means that an implementation should throw an error.

Currently, it's not invalid as per the test suite because the test suite has is_invalid: false in this example.

In packageurl-java, at least, is_invalid: true means that it throws an error, and is_invalid: false, as in the case of dots in the path, means that it does not throw an error but handles it in some other way.

I will lay out again my arguments again as to why the current behavior is suboptimal:

  1. The current way of handling maybe adheres better to the robustness principle, but it goes against the normal semantics of what dots in a path mean, both in the RFC and for filesystem paths in general.
  2. There is something to be gained by following the RFC, since a purl is supposed to be a URI after all, but handles the path differently than other URIs. Furthermore, it would allow the use of standard libraries, such as java.net.URI::normalize to normalize the path.
  3. The spec is unnecessarily complex, and I can see this in practice confusing the authors of various implementations, when it uses language like MUST NOT in one section of the spec, but then goes on to actually allow the behavior in the input in another section of a spec.

@ppkarwasz
Copy link

ppkarwasz commented Mar 6, 2025

  1. The spec is unnecessarily complex, and I can see this in practice confusing the authors of various implementations, when it uses language like MUST NOT in one section of the spec, but then goes on to actually allow the behavior in the input in another section of a spec.

Probably there should a section for a lenient and a strict parser. A strict parser would not discard .. segments, but throw an exception.

@jkowalleck
Copy link
Member

The spec is unnecessarily complex, and I can see this in practice confusing the authors of various implementations, when it uses language like MUST NOT in one section of the spec, but then goes on to actually allow the behavior in the input in another section of a spec.

@dwalluck, please help improve the current spec, by pull requesting improvements of wording and phrasing.
Changing the spec does not help, and may introduce breakage.

@dwalluck
Copy link
Author

dwalluck commented Mar 6, 2025

@dwalluck, please help improve the current spec, by pull requesting improvements of wording and phrasing.

It's hard to suggest an exact change when I am not even certain what is intended.

I can give the same example that we already used.

The statement about dots in the path here, particularly using strong language like MUST NOT

- When percent-decoded, a segment:
- MUST NOT contain a '/'
- MUST NOT be any of '..' or '.'
- MUST NOT be empty

when compared to what it says here

- Discard empty, '.' and '..' segments

If "it" (it is what form? I am not sure it is specified) must not contain them, then we must have thrown an error, hence there's nothing to drop. Otherwise it MAY contain them, but we MUST drop them from the canonical form.

I think I have stated elsewhere that the spec should be clear what is allowed as input vs. what is allowed in canonical form (e.g., what a "relaxed" parser is expected to fix when converted to canonical form and not error out).

I think I know that you intend the dots to be allowed in the input, but not the canonical form, but I do not find that clearly stated. There are conflicts, mostly involving qualifiers and subpath, where multiple things are stated.

Changing the spec does not help, and may introduce breakage.

Since the current spec is not finalized, I did not understand that breaking changes are impossible at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL subpath component
Projects
None yet
Development

No branches or pull requests

4 participants