Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify a loose formatted form? #408

Open
matt-phylum opened this issue Mar 6, 2025 · 0 comments
Open

Specify a loose formatted form? #408

matt-phylum opened this issue Mar 6, 2025 · 0 comments

Comments

@matt-phylum
Copy link
Contributor

Compared to the URL spec, PURL is both more strict and less specified. PURL has a concept of a canonical format where the scheme and type are lowercase, the namespace and name are normalized according to type-specific rules, the qualifier keys are lowercased, empty qualifiers are removed, qualifiers are sorted, . and .. segments are removed from subpaths, and exact sets of characters are percent encoded.

This canonical form is nice because it means software can use PURLs as unique identifiers without understanding how to parse them.

However, I would be surprised if any two PURL implementations behaved the same in all cases. Most implementations make at least one mistake or intentional deviation from the spec, usually around percent encoding. This means in practice if you want to use PURLs as unique identifiers you do need to parse them at some point and convert them back to strings using a single implementation.

This has been coming up recently with the work to make the spec use RFC 2119 language. The spec is being updated to say that the formatters MUST produce the canonical output, which was always the case but with "must" instead, but the parsers are expected to accept formats that the formatters are forbidden from producing. The spec is actually describing two different PURLs: the PURLs that parsers are allowed to read and the PURLs that formatters are allowed to write.

It seems like it would make sense to instead document PURLs and canonical PURLs separately.

Benefits:

  • This would make it easier to write a PURL library because it wouldn't matter if the most convenient URL encoding function available to you encodes more characters than necessary as long as you encode the minimum set of characters.
  • Implementations wouldn't need to implement (potentially incorrect) name normalization rules for every package type in the PURL spec and keep on top of adding new types. Implementations would support the rules they support (or provide a way for the user to customize the behavior) and pass everything else through without normalization. (see Concerns with type-specific component value transformations #38)
  • It would avoid confusion in the spec about whether a PURL is allowed to be a certain way during reading vs during writing

But it makes things more complicated for people who are using PURLs as unique identifiers and need to compare PURLs for equality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant