Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP9 (phase 1): add clarifications for extension naming #330

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

joshmoore
Copy link
Member

@joshmoore joshmoore commented Feb 14, 2025

This PR clarifies the extension mechanism concept in the v3 specification. Comments on any changes which will break existing implementations are STRONGLY encouraged. Please see zarr-developers/zeps#65 for background material.

TODOs:

  • clarify the file numbering (currently 3.0.rst)
  • move definitions to the appropriate location (core, subtype page or ext

Post-merge:

@rabernat
Copy link
Contributor

@joshmoore - really glad you got this started! 🙌

My feedback is that the PR is hard to review. It touches 15 files, including a ton of minor, unrelated formatting changes to the core spec document.

If we want folks to engage and give meaningful feedback, we need to make it easier to review. I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

@joshmoore
Copy link
Member Author

@rabernat
really glad you got this started! 🙌

👍

It touches 15 files

You're right. I've extracted out #331.

including a ton of minor, unrelated formatting changes to the core spec document.

I disagree that they are unrelated. Take a look. The sections I've modified were basically already un-parseable. Since I was adding sections, the outline was getting more convoluted.

I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

👍 Give it a look and let me know what you think.

@jbms
Copy link
Contributor

jbms commented Feb 17, 2025

Thanks for all of your work on this!

My current understanding of the practical effect of proposal is as follows:

-raw names will be granted fairly easily, e.g. zstd, bfloat16, and others I've proposed would be assigned to me, the ones that zarr-python has started using (string, bytes, vlen-utf8, etc.) would be assigned to someone from zarr-python. URL names will be used only for really experimental stuff, all commonly-used extensions will have raw names since they will be minimal effort. Therefore, the verbosity of the URLs is not really a problem in practice.

  • the ZEP process, or really any mandatory review process at all, will not be used for proposing extensions that fit into any of the existing extension points, only for entirely new extension points. At most someone might ask around for comments informally before adopting something.

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

@joshmoore joshmoore mentioned this pull request Feb 18, 2025
@normanrz
Copy link
Member

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

I share your concerns to some degree. I think we can adapt the governance structure for extensions in the future, if we think that a more thorough review process would be necessary. We are thinking of forming a zarr specs team that could take on that responsibility.

@rabernat
Copy link
Contributor

rabernat commented Mar 1, 2025

I just read through this whole thread, as well as the Arrow extension docs. Arrow is very well thought out. I think we should be open to learning from their example. The feedback from @d-v-b and @jbms is also very appreciated.

Here's how I propose we move forward.

  • Drop URLs from the spec. I've been convinced they don't add much value for us and lead to all kinds of weird contradictions.
  • Adopt Arrow-style namespacing for extension names.
    • bare names, i.e. float32 are only for the core dtypes already defined in the spec.
    • canonical extensions use the zarr namespace, i.e. zarr.bfloat16 and will be tracked in the extensions repo we have proposed.
    • user-defined extensions can use any namespace they want, without any central registry. Here we recognize that there could be collisions and don't attempt to solve that problem. Implementations determine how they are going to interpret and resolve these extensions. I like this because it's what we are already doing with numcodecs (e.g. numcodecs.GZip), which demonstrates that it is an obvious, common-sense way to address the problem. Wherever it makes sense, we can try to link to popular extensions from the Zarr docs, as Arrow does for GeoArrow extensions.

For the purpose of developing and maturing extensions, an extension can start as a user-define extension (e.g. numcodecs.GZip) and then evolve to a canonical extension.

We should aim to make the "cannonical extensions" process as simple and easy as possible, to encourage centralization of extensions. This enhances interoperability. Meanwhile, the user-defined extensions mechanism gives a clear way for organizations to define private, custom extensions while staying within the spec.

If a user-defined extension wanted to make itself more discoverable / interoperable, I suppose we could add a specification field to the dtype metadata which permits a link to a URL.

@joshmoore
Copy link
Member Author

joshmoore commented Mar 1, 2025

Commits pushed:

@d-v-b #330 (comment)
In the context of uncoordinated, private extensions, or work within an organization I don't think we should encourage people to use unregistered short names.

I very much agree with this.

Rather, we should discourage (or bar) people from using reserved names, and otherwise allow them to use whatever names are useful for them.

Unfortunately, we didn't put, e.g., a prefix like zarr. in place to begin with making this less than straight-forward.

@d-v-b #330 (comment)
@normanrz and @joshmoore, if you do not want to require that implementations
actually check the requirement that extension names be either a name registerd
on github or a URL, can you explain why you are opposed to changing the
language from "MUST" to "SHOULD"?

I don't remember saying that I was opposed to the language. To explain my thinking, though, reading "SHOULD" would imply to me that there's another permissible choice. What I hear you to be saying is "MUST" would imply that there's an actual error.

@d-v-b #330 (comment)
I think the fundamental problem with this PR right now is the attempt to
require that all extensions be globally interoperable.

I don't see how that's the case. If you mean, it's trying to support a global(ly unique) namespace, then yes, that's true.

I'll leave folks to mull over @rabernat's comment that just came in, and come back to this after the weekend. Best.

@normanrz
Copy link
Member

normanrz commented Mar 1, 2025

Thanks @rabernat. Fundamentally, the arrow extension mechansim seems very similar to what we have been proposing here, which is reassuring. We can of course continue discussing the details, but we should also be sure to steer away from a design-by-commitee situation.

For Zarr, I don't see a good reason for differentiating between "bare names" and prefixed "canonical extensions". We can control the naming of both uniformly through the zarr-extensions repo. Extensions can include prefixes, if they want, but I wouldn't force that. Our process for registering raw names is definitely more lightweight than registering a canonical extension in arrow.

Re "user-defined extensions". I want to remind everybody that in our proposal it is very easy to register a "raw name" in zarr-extensions. Also, it is an express goal of our proposal to avoid naming conflicts. I wouldn't want to step back from that.

Re dropping URLs. URLs have a couple of nice properties that we want for avoiding naming conflicts, self-documentation, and compatibility with json-ld. On the other hand, the downside of URLs seems to boil down to people finding URLs weird. So, I would be inclined to stick with URLs. Also, I want to reiterate that it will be very easy to register raw names, so, most extensions in the field will not use URLs.

Re maturing extensions. In an earlier version of our proposal, we had that extensions would mature by changing their names (i.e. from URL over prefixed to raw name). Now, we think it is better to find a different denotion of maturity so that extensions don't have to change their name, which would create unnecessary complexity for implementations.

In summary, I think our two-level naming system (i.e. centralized through zarr-extensions and uncoordinated free-for-all) is less complex than adopting arrow's system and would work really well for Zarr and fit the current community practice.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 2, 2025

Just to emphasize, I want a boring, simple solution here. Our default solution should be to copy something that has worked in a similar project. If you are rejecting something that has worked for another project, then I would like to see an engineering-based explanation for that decision.

On its own terms, the arrow spec is pretty simple: there are two types of extensions. The first extension type is decentralized, and the spec makes NO requirements for what names they use, only recommendations:

We recommend that you use a “namespace”-style prefix for extension type names to minimize the possibility of conflicts with multiple Arrow readers and writers in the same application. For example, use myorg.name_of_type instead of simply name_of_type
...
Extension names beginning with arrow. are reserved for canonical extension types, they should not be used for third-party extension types.

IMO This language is very easy for implementations to understand. It doesn't aim to globally prevent name collisions, but I suspect that is OK in practice. We should learn from this.

The second extension type is centralized, and there are more requirements, but crucially, there are no explicit name requirements. Instead, all the requirements are scoped to the extension itself. I think it's safe to assume that any name collisions will be handled by the process of vetting the extension.

I think both of these extension types could work for us. I also think the arrow spec is also simpler than this PR, because the arrow spec imposes fewer requirements. For an extension developer, you just have to choose a name that composes with the extensions your implementation already knows about, and you are done. That's far simpler than introducing a dependency on a separate github repo.

As an implementation developer, I would be happy working with something like the arrow spec. I cannot say the same for this PR in its current state.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 2, 2025

As an implementation developer, I would be happy working with something like the arrow spec. I cannot say the same for this PR in its current state.

to elaborate on this: I am specifically opposed to the requirement that extension names be registered on github OR a URL. I'm open to discuss alternatives to this requirement (e.g., making it a suggestion, or finding another way entirely to achieve the goals of this requirement).

@maxrjones
Copy link
Member

maxrjones commented Mar 2, 2025

Thanks for all your work on enabling extensions! I have a few comments based on my experiences contributing to the GeoZarr WG over the past couple years and reading through all the linked documents.

Specific concern around URLS

I wanted to offer a couple recent experience-based observations that could help make concerns with URLs a bit more concrete:

  1. Domains are easily lost. The most common case is when people forget to pay the DNS registration renewal fee. Many projects also don't have backup plans for sharing ownership of domains. I had the sad job recently of rescuing a domain when a project's BFDL deceased, which was thankfully still possible but only due to his family's generosity with their time. While not super common, these things happen and would be way more stressful for project maintainers if that domain was also the immutable registration point for a Zarr extension.
  2. GitHub organizations change (see Move or fork to independent organization nsidc/earthaccess#929 for an example of an ongoing discussion). The Zarr community can design the format to be interpretable for decades or hopefully much longer, and who knows what code hosting will look like in future decades.

If it's truly necessary to have a persistent identifier for extensions, DOIs were created for this purpose. The one component in the original description for URLS that aren't possible with DOIs is arguably self-describability, but in practice URLs can be quite challenging to interpret without visiting the content hosted.

Recommendation to bring back the extension key

I really like how STAC stores all the extensions in a distinct metadata field and also preferred that structure in zarr-developers/zeps#65. In addition to the performance benefits, I think the readability is super key to keeping Zarr's easy interpretability as a strength. E.g., I find it quite hard to tell which keys in the example are actually extension points vs. the core spec without reviewing through the entire set of documents another time. It would be immediately obvious if extensions were stored under an 'extensions' key or something similar.

Recommendation to avoid conflicts within attributes

A lot of extensions will likely relate to metadata stored under attributes. Could it be in-scope to recommend that extension-specific metadata is stored under a key within attributes that matches the extension name, as is done by NGFF, since that falls under a similar scope to this PR in avoiding conflicts between extensions?

@jbms
Copy link
Contributor

jbms commented Mar 3, 2025

Thanks for all your work on enabling extensions! I have a few comments based on my experiences contributing to the GeoZarr WG over the past couple years and reading through all the linked documents.

Specific concern around URLS

I wanted to offer a couple recent experience-based observations that could help make concerns with URLs a bit more concrete:

  1. Domains are easily lost. The most common case is when people forget to pay the DNS registration renewal fee. Many projects also don't have backup plans for sharing ownership of domains. I had the sad job recently of rescuing a domain when a project's BFDL deceased, which was thankfully still possible but only due to his family's generosity with their time. While not super common, these things happen and would be way more stressful for project maintainers if that domain was also the immutable registration point for a Zarr extension.
  • As far as I understand, given the ease of registration of an extension, use of URLs would probably be relatively rare and mostly for experimental things.
  • Loss of control of the domain would mean that the URLs are no longer resolvable to the specification, except perhaps via archive.org, assuming they resolved to a specification in the first place. However, there would almost surely still be a record of the extension in the source code of one or more zarr implementations that has support for the extension, and that implementation could probably be found via a google search for the extension identifier.
  • While loss of control of the domain could theoretically create a potential for naming conflicts, I don't expect that would be a problem in practice.
  1. GitHub organizations change (see Move or fork to independent organization nsidc/earthaccess#929 for an example of an ongoing discussion). The Zarr community can design the format to be interpretable for decades or hopefully much longer, and who knows what code hosting will look like in future decades.

If it's truly necessary to have a persistent identifier for extensions, DOIs were created for this purpose. The one component in the original description for URLS that aren't possible with DOIs is arguably self-describability, but in practice URLs can be quite challenging to interpret without visiting the content hosted.

A DOI would already be allowed as a URL, but being a numeric identifier would make the metadata very difficult for humans to interpret.

Do you have a specific alternative proposal for naming?

The name registration process already would seem to address all of these concerns.

Recommendation to bring back the extension key

I really like how STAC stores all the extensions in a distinct metadata field and also preferred that structure in zarr-developers/zeps#65. In addition to the performance benefits, I think the readability is super key to keeping Zarr's easy interpretability as a strength. E.g., I find it quite hard to tell which keys in the example are actually extension points vs. the core spec without reviewing through the entire set of documents another time. It would be immediately obvious if extensions were stored under an 'extensions' key or something similar.

I'm not sure I understand the argument regarding performance benefits. But I'm also not sure exactly what you have in mind regarding being able to distinguish which keys are extensions vs core --- are you saying that you want zstd, vlen-utf8, and string to be listed separately as well in some top-level "extensions" metadata field in addition to how they are listed in the example? What is not clear to me is why it is important to easily distinguish "extension" from "core". Within an implementation, there may be no difference whatsoever between a "core" feature and an "extension" feature. Similarly, for a user the only thing that matters is how widely supported a given feature is, and an extension might even be more widely supported than an optional core feature.

The https://github.com/zarr-developers/zarr-extensions repo already provides a unified listing of things regardless of whether they are in the core spec or an extension.

I think this proposal is in part an acknowledgement that the ZEP process has not worked well for defining extensions under the existing extension points and I expect that if this proposal is accepted, no new extensions under the existing extension points may be added to the core spec, and the ZEP process would only be used for new extension points.

Recommendation to avoid conflicts within attributes

A lot of extensions will likely relate to metadata stored under attributes. Could it be in-scope to recommend that extension-specific metadata is stored under a key within attributes that matches the extension name, as is done by NGFF, since that falls under a similar scope to this PR in avoiding conflicts between extensions?

Extensions are explicitly intended for things that alter the behavior of the zarr implementation itself. OME-Zarr just builds on top of Zarr and therefore would not be considered a zarr extension. @rabernat previously proposed to call such things as OME-Zarr "zarr conventions".

I actually expect that it would be relatively unlikely for a zarr extension to store metadata in attributes because users are supposed to have full control over attributes, which could create a conflict with the extension.

However, it could be very reasonable for there to be a registry of attribute names that is very similar to the registry of zarr extensions. The only issue is that currently there are no restrictions on what attribute keys are allowed and therefore it is not clear how to distinguish "registered attributes" from plain attributes.

In any case I think it would be good to limit the scope of this discussion to just proper zarr extensions in the interest of getting that part sorted out more quickly and efficiently.

@normanrz
Copy link
Member

normanrz commented Mar 3, 2025

Thanks @maxrjones. I largely agree with @jbms's reply, but would like to add two points:

Recommendation to bring back the extension key

In ZEP 9, we proposed the extensions key. We still want to do that, but in a separate PR and, likely, with a vote on the ZEP. This PR is meant to clarify things that don't require a full vote.

Recommendation to avoid conflicts within attributes

Currently, OME-Zarr puts all its metadata under attributes because there is no other place. I wouldn't entirely rule out that OME-Zarr would become a Zarr extension (i.e. metadata under extensions), in the future. But I agree that this would be a future discussion.

@maxrjones
Copy link
Member

Do you have a specific alternative proposal for naming?

In the interest of speed and since it's easier to add than take away, you could just take out URLs as Davis and Ryan asked for, see how it goes with requiring non-url based registration in the zarr-extensions repo, and add it as a new PR if it seems like it's necessary. As I've now given both my concrete concern and a specific proposal, I'm not going to engage on the URL debate further. Thanks again for working on this!

Regarding my other comments, I am now more confused about what could become a proper Zarr extension and therefore fall under the purview of these naming requirements. I started a thread on Zulip to get clarification without diluting the discussion on this PR, if anyone would be willing to offer clarifications there 🙏

@rabernat
Copy link
Contributor

rabernat commented Mar 3, 2025

Just wanted to note that there are two new contributions for dtypes and codecs in V3 over in Zarr Python

These offer a great opportunity for us to explore the implications of this ZEP. What sort of guidance would we provide to these contributors on naming their codecs?

@joshmoore
Copy link
Member Author

It's a good question, @rabernat. From the current written text, their next step would be to open a PR against zarr-extensions (And I know having testers there would make @normanrz happy).

If they didn't want to do that, they could use a URL (e.g., https://github.com/dimitri-yatsenko/anscombe-numcodecs) with the caveat that nothing in this PR (nor anything in the comments to date) deals with later wanting move to (or alias) a raw name.

What would you want the guidance to them to look like?

@normanrz
Copy link
Member

normanrz commented Mar 4, 2025

In the interest of speed and since it's easier to add than take away, you could just take out URLs as Davis and Ryan asked for, see how it goes with requiring non-url based registration in the zarr-extensions repo, and add it as a new PR if it seems like it's necessary.

I like that!

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate everyone's constructive comments and feedback. Here is a set of changes that I believe addresses some of the remaining points around extension naming. Specifically, it replaces URL-based extensions names with Arrow-style namespaced extensions.

Simply dropping URLs is not ideal. There will inevitably be organizations who want and need totally private extensions, which they never intend to share with the rest of the world. This should be explicitly allowed by the spec. This is what namespaced extensions are for. This also covers all of the development scenarios.

Comment on lines +1090 to +1091
a known "`raw name <extension-naming-raw-names>`_" or
a "`URL-based name <extension-naming-url-based-names>`_" as defined under :ref:`extensions_section`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a known "`raw name <extension-naming-raw-names>`_" or
a "`URL-based name <extension-naming-url-based-names>`_" as defined under :ref:`extensions_section`.
a known "`raw name <extension-naming-raw-names>`_" (for registered extensions) or
a "`namespaced extension <extension-naming-namespaced-names>`_" (for private / experimental extensions) as defined under :ref:`extensions_section`.

Extension naming
----------------

The `name` field of an extension can take two forms: **raw names** and **URL-based names**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `name` field of an extension can take two forms: **raw names** and **URL-based names**.
There are two types of extensions names:
- **raw names** - intended for well-known extensions aimed at broad adoption and maximum interoperability.
- **namespaced extensions** - intended for private extensions and development purposes.

Copy link
Contributor

@d-v-b d-v-b Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's more clear to say that there are two types of extensions -- those that are centrally registered (these may have raw names OR namespaced names), and those that are not centrally registered (these should have namespaced names).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we be registering namespaced codecs? In my head these were mutually exclusive categories. What you describe sounds more confusing and ambiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if codec foo.codec becomes popular enough that its owners want to register it centrally, doesn't it make sense to keep the same name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would help to sketch out the process for "publishing" an extension that was previously unpublished -- it sounds like you would want the prefix to be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, someone else might be using the same namespace for their own private use. Registering a namespaced name creates the possibility of a conflict.


Raw names are centrally registered names which can be used without prefix.

Raw names MUST be assigned within a central repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Raw names MUST be assigned within a central repository.
Raw names MUST be assigned within a central repository, in order to ensure their uniqueness.

Raw names MUST be assigned within a central repository.
Raw names are unique and immutable.
Raw names MUST start with one lower case letter a-z and then be followed
by only lower case letters a-z, numerals 0-9, underscores, dashes, and dots.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
by only lower case letters a-z, numerals 0-9, underscores, dashes, and dots.
by only lower case letters a-z, numerals 0-9, underscores, and dashes.

Dot characters are forbidden, to avoid confusion with namespaced extensions.

discretion.

- **Example:** ``zstd``
- **Accepted regex:** ``^[a-z][a-z0-9-_.]+$``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Accepted regex:** ``^[a-z][a-z0-9-_.]+$``
- **Accepted regex:** ``^[a-z][a-z0-9-_]+$``

Comment on lines +1664 to +1668
* If you are just getting started, use the URL of your work-in-progress as an
identifier for your extension. The GitHub link, including the branch if you
would like, makes a fine choice. This says to the community that this is a
draft, and if they are interested in the details, they can follow the URL to
find out more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If you are just getting started, use the URL of your work-in-progress as an
identifier for your extension. The GitHub link, including the branch if you
would like, makes a fine choice. This says to the community that this is a
draft, and if they are interested in the details, they can follow the URL to
find out more.
* If you are just getting started, use a namespaced extension for your extension name. As you extension matures, you may consider registering it using a Raw name.

Comment on lines +1670 to +1674
* When developing an extension for which you intend to register a short name,
you may wish to test it using the short name even before you have registered
it. However, you MUST register the name before using the extension for
non-test purposes/for purposes where interoperability with other
implementations/users is a concern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* When developing an extension for which you intend to register a short name,
you may wish to test it using the short name even before you have registered
it. However, you MUST register the name before using the extension for
non-test purposes/for purposes where interoperability with other
implementations/users is a concern.
* If you intend to distribute data widely using your extension, you SHOULD register your extension using Raw name, rather than a namespaced name.

Comment on lines +1683 to +1687

* If you migrate your URL-based extension to a new location, try to redirect the
previous URL to the new location or document the migration. Similarly, if you
register a raw name extension after having used an URL-based extension in production,
cross-link the two pages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If you migrate your URL-based extension to a new location, try to redirect the
previous URL to the new location or document the migration. Similarly, if you
register a raw name extension after having used an URL-based extension in production,
cross-link the two pages.

Comment on lines +1680 to +1682
* For raw names that are coming from well-known projects, use the same prefix followed
by a dot for requesting your raw name, e.g. "numcodecs.". Other examples of prefixes can
be found in the `zarr-extensions`_ repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these would be considered namespaced extensions under my proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment above but we should clarify early on that registered names can also be namespaced.


- Clarification of extensions. `PR #330
<https://github.com/zarr-developers/zarr-specs/pull/330/>`_. With this change,
it is now possible to register new names or even use URLs for extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it is now possible to register new names or even use URLs for extensions.
it is now possible to register new names or used namespace-prefixed names for extensions.

@joshmoore
Copy link
Member Author

Before too many comments are made here, a heads up that I'd appreciate having this as a PR so it can be downloaded and built with sphinx, check for warnings, etc. I didn't find a way to do that for @d-v-b's comments in the first round.

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2025

Ok I will make a PR.

@joshmoore
Copy link
Member Author

joshmoore commented Mar 6, 2025

Thanks! I assume the comments above though give folks a good sense of what you're thinking and the comments can start. Looking forward to everyone's feedback.

Edit: I realized that I could accept all the comments and then revert, but it seems like that could be confusing.

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2025

I made a PR against your branch here: joshmoore#1

@LDeakin
Copy link

LDeakin commented Mar 6, 2025

Comments on any changes which will break existing implementations are STRONGLY encouraged

The current spec says

In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec

So aren't all the changes proposed so far a regression? Zarr 3.0 arrays with a URI codec name are currently conformant but will become non-conformant. I suggest just adding something like the following to respect the stability policy:

In Zarr 3.0, the name of an extension codec was required to be a URI that dereferences to a human-readable codec specification. That is now discouraged and ...

@d-v-b
Copy link
Contributor

d-v-b commented Mar 6, 2025

Comments on any changes which will break existing implementations are STRONGLY encouraged

The current spec says

In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec

So aren't all the changes proposed so far a regression? Zarr 3.0 arrays with a URI codec name are currently conformant but will become non-conformant. I suggest just adding something like the following to respect the stability policy:

In Zarr 3.0, the name of an extension codec was required to be a URI that dereferences to a human-readable codec specification. That is now discouraged and ...

I always found this requirement hard to understand, given that none of the codecs defined alongside the v3 spec followed it, which suggested that it was not actually a real requirement. Are there codecs in the wild that followed this requirement?

@LDeakin
Copy link

LDeakin commented Mar 7, 2025

I always found this requirement hard to understand, given that none of the codecs defined alongside the v3 spec followed it, which suggested that it was not actually a real requirement.

Oh that is a good point, I'd interpreted it as just for codecs not defined in the spec. And you've of course noted in the past the inconsistency in that section. Although I think the intention was clear: use URIs to avoid clashes and make it possible to resolve how the data was encoded. ZEP0009 achieves the same objective and is a big improvement, but it should be adapted to achieve that without potentially invalidating existing data and breaking the stability policy. ZEP0009 can achieve this with a few extra words and no effort from implementations.

Are there codecs in the wild that followed this requirement?

I did follow the spec requirement to use URIs for the custom codecs I've got in zarrs. I can alias those and continue to read them if they go into zarr-extensions, but existing data would become officially non-conformant given the change from MUST use URIs to MUST NOT use URIs except URLs with URLs looking to be removed too with joshmoore#1.

@joshmoore
Copy link
Member Author

Interesting points, @LDeakin. Thanks. ZEP9's goal was to be as conformant as possible to the various interpretations in v3.0 ("use URIs" while clearly defining raw names). My earlier change to this PR restricting URIs to URLs to make the identifiers more useful (i.e. self-documenting) was unintentionally breaking. (I had wanted to get back to URIs with a later phase of ZEP9, but of course that doesn't fix it.)

Reading through Ryan's PR, I did wonder if there wasn't an URN (as a specific type of URI) that we could use. The closest I could find would be the "eXperimental" prefix: urn:x-. It's technically deprecated. Section https://www.rfc-editor.org/rfc/rfc6648#page-4 is interesting though in that it says protocol designers "SHOULD NOT disallow x-" (👍🏽) but "SHOULD define simple, clear registration procedures" (:+1:)

"Raw"/Registered names could be considered (or renamed) to "shortcuts" for "urn:zarr*"1 and the non-registered names could be definitively prefixed with urn:x-, e.g. urn:x-zarr: or urn:x-YOURORG:.

Going this route would mean all URIs are again permissible but discouraged. urn:uuid: would even be an option, but we'd clearly recommend that authors SHOULD use one of the simpler forms.

This runs the risk of not always fulfilling @d-v-b's ask for a clear code-compatible label (or "short name" as @jhamman described it from cfconventions) but might balance some of the other priorities that have been expressed above.


1 Here I use zarr* to mean we might want/need urn:zarr-dtype: etc. We shouldn't use urn:zarr* until it can be registered & approved. This work could conceivably also go together with an IANA registration of one or more mimetypes (#123) which I've also recently been asked about.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 7, 2025

if we think the number of affected codecs is small, I would opt for @LDeakin's suggestion of adding a note that the use of a URI was once encouraged and is now discouraged, and otherwise going with the direction of @rabernat's changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants