Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds codecs that numcodecs defines #2

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

adds codecs that numcodecs defines #2

wants to merge 4 commits into from

Conversation

normanrz
Copy link
Member

@normanrz normanrz commented Feb 24, 2025

  • Blosc
  • LZ4
  • Zstd
  • Zlib
  • GZip
  • BZ2
  • LZMA
  • Shuffle
  • CRC32
  • CRC32C
  • Adler32
  • Fletcher32
  • JenkinsLookup3
  • PCodec
  • ZFPY

@normanrz
Copy link
Member Author

normanrz commented Mar 1, 2025

I validated the schema.jsons agains the numcodecs fixtures:

# /// script
# dependencies = [ "jsonschema" ]
# ///

from jsonschema import validate
import json
from pathlib import Path

numcodecs_fixture_path = (
    Path.home() / "numcodecs" / "fixture"
)
for path in Path("codecs").glob("numcodecs.*/schema.json"):
    _, name = path.parent.name.split(".")
    print(name)
    for fixture_path in (numcodecs_fixture_path / name).glob("**/config.json"):
        print("  ", fixture_path)
        config_json = json.loads(fixture_path.read_text())
        config_json.pop("id", None)
        config_json = {"name": f"numcodecs.{name}", "configuration": config_json}

        validate(
            instance=config_json,
            schema=json.loads(path.read_bytes()),
        )

@jbms
Copy link

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

@normanrz
Copy link
Member Author

normanrz commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

I agree and would welcome contributions. Unfortunately, the numcodecs documentation is also pretty sparse on encoding details. So, for every codec we need to go through the code and write a spec.
It is strongly encouraged to write a specification, but not a must. In the interest of time, I wanted to have these specification scaffolds in to reserve the names and leave the spec details for later.

@jbms
Copy link

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

I see --- I did not realize that zarr-python had added all of the numcodecs codecs for zarr v3 as numcodecs.xxx.

I imagine it was done to make it very easy for someone using zarr-python to migrate to using zarr v3 -- which is understandable.

However, from an interoperability perspective this is kind of unfortunate --- someone using zarr-python with zarr v3 and a numcodecs.XXX codec may not realize that they are producing a zarr array that is not interoperable with any other zarr implementation, because the codec gets recorded as numcodecs.xxx. That is particularly unfortunate for cases like gzip or blosc or zstd where other implementations do in fact support those codecs both with zarr v2 and zarr v3, and had the zarr-python user specified the codec in exactly the same way but used zarr v2 instead of zarr v3 they would also produce an interoperable array, but by specifying zarr v3 they produce a non-interoperable array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants