Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lower-precision integer and floating point data types, and packbits codec #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jbms
Copy link

@jbms jbms commented Mar 4, 2025

I intend to also register all of the other data types listed here:

https://pypi.org/project/ml-dtypes/

8-bit floating point representations, parameterized by number of exponent and mantissa bits, as well as the bias (if any) and representability of infinity, NaN, and signed zero.

float8_e3m4
float8_e4m3
float8_e4m3b11fnuz
float8_e4m3fn
float8_e4m3fnuz
float8_e5m2
float8_e5m2fnuz
float8_e8m0fnu

Microscaling (MX) sub-byte floating point representations:

float4_e2m1fn
float6_e2m3fn
float6_e3m2fn

Narrow integer encodings:

int2
int4
uint2
uint4

Potentially I'll add the others to this PR, but I wanted to make sure the bfloat16 README was in order first.

There are a few questions I have:

  • Should they all be specified as independent documents, or should some be combined to a single document somehow?
  • Should a trivial schema just listing the data type name be provided?
  • I have a link to the main spec, which unfortunately includes "v3.0" which presumably will become stale at some point.
  • Data types interact with other things in the spec in a few ways:
    • Fill values
    • Codecs

Currently, for the core data types, we specify the fill value representation under the fill_value section, separately from where the data types themselves are defined.

We have to specify how the data type is handled by each codec that supports it. In the case of bfloat16 it is only the bytes codec. The bytes codec description itself specifies how it handles all of the core data types, but presumably we would instead specify that as part of the extension data type specification.

In the case of bfloat16, the bytes encoding is so obvious that it hardly requires any explanation at all. For the other data types listed above that are less than 1 byte, however, we have to say that they will be padded to 1 byte with the high bits ignored. Additionally, I may want to register a pack_bits codec in the future that would support bool as well as these other data types that are less than 1 byte.

If I am the maintainer of all of the relevant extensions then there is no issue since I can modify them to reference each other as needed, but if I were not, it is less clear how we would deal with this.

@normanrz
Copy link
Member

normanrz commented Mar 4, 2025

There are a few questions I have:

Should they all be specified as independent documents, or should some be combined to a single document somehow?

The idea was to have one folder+readme per dtype. We have to see how well that scales over time.

Should a trivial schema just listing the data type name be provided?

That would be awesome. Strictly speaking, I think an object notation would also be valid:

{"name":"bfloat16"}

even with an empty configuration

{"name":"bfloat16", "configuration": {}}

I have a link to the main spec, which unfortunately includes "v3.0" which presumably will become stale at some point.

@joshmoore What do you think about that?

Data types interact with other things in the spec in a few ways:
Fill values
Codecs

Dtypes should define the acceptable values for their fill values.
The interaction with codecs needs a bit more spec work. We probably need to expect the bytes codec to be expanded to extension dtypes.

@jbms
Copy link
Author

jbms commented Mar 4, 2025

For now I can just specify in the data type specification how it interacts with the bytes codec, and then later update it when/if the pack_bits codec is added.

In principle we could have the situation where one person adds the int4 data type, and another person later adds the pack_bits codec but only mentions bool and not int4 --- and then a third person wants to make pack_bits work with int4.

@jbms jbms changed the title bfloat16 data type Add lower-precision integer and floating point data types, and packbits codec Mar 5, 2025
@jbms
Copy link
Author

jbms commented Mar 5, 2025

I added all of the other data types, and also added the packbits codec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants