Specify a loose formatted form? #408

matt-phylum · 2025-03-06T17:05:46Z

Compared to the URL spec, PURL is both more strict and less specified. PURL has a concept of a canonical format where the scheme and type are lowercase, the namespace and name are normalized according to type-specific rules, the qualifier keys are lowercased, empty qualifiers are removed, qualifiers are sorted, . and .. segments are removed from subpaths, and exact sets of characters are percent encoded.

This canonical form is nice because it means software can use PURLs as unique identifiers without understanding how to parse them.

However, I would be surprised if any two PURL implementations behaved the same in all cases. Most implementations make at least one mistake or intentional deviation from the spec, usually around percent encoding. This means in practice if you want to use PURLs as unique identifiers you do need to parse them at some point and convert them back to strings using a single implementation.

This has been coming up recently with the work to make the spec use RFC 2119 language. The spec is being updated to say that the formatters MUST produce the canonical output, which was always the case but with "must" instead, but the parsers are expected to accept formats that the formatters are forbidden from producing. The spec is actually describing two different PURLs: the PURLs that parsers are allowed to read and the PURLs that formatters are allowed to write.

It seems like it would make sense to instead document PURLs and canonical PURLs separately.

Benefits:

This would make it easier to write a PURL library because it wouldn't matter if the most convenient URL encoding function available to you encodes more characters than necessary as long as you encode the minimum set of characters.
Implementations wouldn't need to implement (potentially incorrect) name normalization rules for every package type in the PURL spec and keep on top of adding new types. Implementations would support the rules they support (or provide a way for the user to customize the behavior) and pass everything else through without normalization. (see Concerns with type-specific component value transformations #38)
It would avoid confusion in the spec about whether a PURL is allowed to be a certain way during reading vs during writing

But it makes things more complicated for people who are using PURLs as unique identifiers and need to compare PURLs for equality.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify a loose formatted form? #408

Specify a loose formatted form? #408

matt-phylum commented Mar 6, 2025

Specify a loose formatted form? #408

Specify a loose formatted form? #408

Comments

matt-phylum commented Mar 6, 2025