Representation of and operations on pointers and usize #251

New issue

Closed

Representation of and operations on pointers and usize#251

Labels

C-support

mahkoh

I have a few questions regarding usize and raw pointers for sized T:

Is usize guaranteed to have the same layout as one of the uN?
Is usize guaranteed to have no padding bits?
Is usize as *mut T as usize the identity function on usize?
Is usize as *mut T guaranteed to be the same as transmute::<usize, *mut T>?
Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?
Is usize-literal as *mut T guaranteed to have no special behavior? (e.g. for 0usize)

The current documentation

does not answer these questions afaict.

RalfJung

Member

Is usize guaranteed to have no padding bits?

Yes, all our integer types are.

Is usize as *mut T as usize the identity function on usize?
Is usize as *mut T guaranteed to be the same as transmute::<usize, *mut T>?

These are contingent on details of pointer provenance and how it interacts with integer-pointer cats... so the answer is "we don't know". FWIW, that is also the answer for LLVM IR, and likewise in C/C++. (See this for some of the recent work on the C/C++ side, but I hope we can find something cleaner for Rust.)

The opposite direction, *mut T as usize, will most likely not be equivalent to a transmute -- in fact so far there is no good proposal on the table for how to make the transmute not UB, which is rather unsatisfying. (This paper is one of the most recent proposals.)

Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?

No, certainly not. For example, if T := bool, then the former is UB is the value you are loading is not a valid bool.

Is usize-literal as *mut T guaranteed to have no special behavior? (e.g. for 0usize)

Yes.

added

Author

Thanks for the links I'll look at them later.

Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?

No, certainly not. For example, if T := bool, then the former is UB is the value you are loading is not a valid bool.

ptr::read_unaligned::<*mut T> produces a *mut T not a T.

IIRC, pointers actually have trap representations on certain architectures in C. Since raw pointers in Rust are somewhat relaxed compared to C pointers, can we guarantee that any bit pattern is a valid raw pointer in Rust?

RalfJung

Member

ptr::read_unaligned::<*mut T> produces a *mut T not a T.

Oh sorry, I misread. So regarding your question then, I think the answer is yes -- everything that's a valid integer will also be a valid pointer. The other way around might not be true though, that is again about pointer provenance.

But really the proper answer is that Rust doesn't have a spec yet that can answer such detailed questions, sorry. :/

Since raw pointers in Rust are somewhat relaxed compared to C pointers, can we guarantee that any bit pattern is a valid raw pointer in Rust?

No -- see #71 for the related discussions on validity of integers. ("Trap representations" are not a thing in Rust, but we have a related concept of "invalid values".) If uninitialized integers are invalid, then uninitialized raw pointers will also be invalid.

digama0

The opposite direction, *mut T as usize, will most likely not be equivalent to a transmute -- in fact so far there is no good proposal on the table for how to make the transmute not UB, which is rather unsatisfying. (This paper is one of the most recent proposals.)

Forgive my naivete, but what are the blockers behind both directions being trivial? If raw pointers are just treated as integers, then these casts become trivial to specify, and all the interesting work goes into &mut T -> *mut T and *mut T -> &mut T instead.

RalfJung

Member

Pointers in LLVM are not integers, so we cannot make them integers in Rust either. Potentially we could compile raw pointers differently, not using LLVM pointers, but I am not sure how well that works (and LLVM semantics in this area are so unclear, not to say buggy, that there is no telling if that would actually help).

RalfJung

Member

Oh also, Rust functions like offset and wrapping_offset clearly show that the intent in Rust is for raw pointers to carry provenance. The goal, as far as I understand it, is that raw pointers in Rust are like pointers in C -- and those are complicated.

I like the idea of only references having provenance, but I think it is unfortunately unrealistic. It would be good to have more data on this though, like numbers for what the perf impact would be if we used LLVM integers to compile Rust raw pointers. Also see these provenance-related discussions in the UCG.

digama0

I will grant that pointers are complicated, but it seems that pointer provenance already gets wrapped up in discussions about "what's in a byte" even when only talking about integer types. So it seems like we can just make raw pointers exactly as "provenant" as usize and then the cast is again trivial, notwithstanding the unsolved problem of exactly how much provenance to carry around through both raw pointers and integers.

RalfJung

Member

I am working under the assumption that usize does not have provenance. This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional. (Even if they are equal integers, their provenance might differ.) Moreover, you lose some arithmetic properties, depending on how exactly provenance propagates through arithmetic -- you lose associativity and/or commutativity, x * 0 is not the same as 0, ...

The big advantage of pointers is that their only "arithmetic" is "add an integer", so it is easy to say what happens to provenance: the result has the same provenance as the left input (the only pointer input). But provenance needs to end when pointers are cast to integers, and something needs to be done about pointers being transmuted to integers. The easiest option would be to declare it UB, and C's strict aliasing even almost does that. It could also behave like uninitialized integers, but it is unclear if that is any better.

This C++ proposal has more detail; and here is even more material.

So it seems like we can just

Quite a few people have thought about these problems for years; there does not seem to be an easy solution that we can "just" use. ;)

digama0

What is the reason that all this mess has to be in the spec instead of just in the compiler though? SB already gives a semantics that tracks where you are or are not allowed to write, and once you are in raw pointer land most of the value tracking turns off. If the compiler can infer some provenance information, great, but the spec doesn't need to play that game.

Put another way, what is a program that should be UB that SB says is fine, and relies on tracking pointer provenance through raw pointers and/or integers?

This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional.

Isn't there an LLVM bug (that you filed, I think) that does this with pointers?

sollyucko

Is usize guaranteed to have the same layout as one of the uN?

Cases where this might not be true include:

eZ80 in ADL mode (24-bit registers and addresses)
Possibly segmented x86 (AFAICT 16 bit registers and 21 bit addresses (unless represented as 32 bits using a segment value and offset)?)
Motorola 56000 (24/48/56-bit registers, 16-bit addresses?)

digama0

I would hope that we can at least ensure that it has the same layout pattern as uN where N is the number of bits in the word; that is, if the word size is 24 bits then usize = u24, if such a thing existed. I suppose the only non obvious property is alignment, but that probably has to be specified by the target architecture in any case.

RalfJung

Member

What is the reason that all this mess has to be in the spec instead of just in the compiler though?

Because so far nobody has been able to propose a semantics that can hide this from the spec, but still do all the desired optimizations and at the same time provide pointer-integer casts.

If the compiler can infer some provenance information, great, but the spec doesn't need to play that game.

It does, though. Stacked Borrows relies on this provenance information, and that effect cannot be entirely hidden from the semantics in a language with pointer-integer casts.

Put another way, what is a program that should be UB that SB says is fine, and relies on tracking pointer provenance through raw pointers and/or integers?

Stacked Borrows "gives up" on raw pointers (treating them as having much less provenance -- but there is still some provenance left, namely to track the allocation ID that the pointer originally pointed to), but that is intended as a temporary stepping-stone. #86 contains some examples of optimizations that we want to do, and that LLVM does, but that SB currently fails to support because it forgets provenance for raw pointers.

I already mentioned offset as a function whose UB can only be modeled if raw pointers have provenance.

This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional.

Isn't there an LLVM bug (that you filed, I think) that does this with pointers?

Do you mean this one? Yes, LLVM does this with pointers and that's wrong. The goal is to make it wrong for pointers but right for integers.

mahkoh

Author

@RalfJung

The goal, as far as I understand it, is that raw pointers in Rust are like pointers in C

I do not believe so. I've taken a quick look at annex J (unspecified/undefined/implementation-defined behavior) of C11:

Operation	C Pointer	Rust raw pointer	Rust reference
The value of a pointer to an object whose lifetime has ended is used	❌	👍	❌
Conversion between two pointer types produces a result that is incorrectly aligned	❌	👍	❌
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that does not point into, or just beyond, the same array object	❌	👍 (wrapping_add)	n/a
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond the array object and is used as the operand of a unary * operator that is evaluated	❌	~~👍 (wrapping_add, if the pointer is valid)~~ ❌	n/a
Pointers that do not point to the same aggregate or union (nor just beyond the same array object) are compared using relational operators	❌	👍	n/a
The value of a pointer that refers to space deallocated by a call to the free or realloc function is used	❌	👍	❌
An object has its stored value accessed other than by an lvalue of an allowable type (strict aliasing)	❌	👍	👍

People who want to write unsafe code often consult the C standard to see what works and what doesn't work. I believe it would be good to have an informal document that goes through the C standard and compares C semantics to the current Rust semantics.

Besides the differences regarding pointers, another example are unions which do not have an active field in Rust and can therefore be used to implement transmute.

PS: I haven't yet have time to look at your links.

RalfJung

Member

wrapping_add is just an operation that Rust has that C could have but doesn't -- we also have add/offset corresponding to the usual C-style rules for pointer arithmetic. But I take your other points. Rust raw pointers are more relaxed than C's.

The fact remains that both offset and wrapping_offset just make no sense for pointers without provenance. The only reason these operations exist is because they preserve provenance, and the people introducing them considered it important that LLVM has that information available for its optimizations. That is why I am working under the assumption that raw pointers should have provenance.

Note that raw pointers not having provenance does not really solve any of the hard problems, it just moves them around. Everything that is currently tricky about casting and transmuting between raw pointers and integers, then becomes tricky about casting and transmuting between raw pointers and references. We have to figure out answers to these questions anyway, we have to find some good way to handle the "boundary" between what has provenance and what does not. Where we put that boundary is mostly orthogonal.

25 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

C-support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Representation of and operations on pointers and usize #251

25 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Representation of and operations on pointers and usize #251

Description

Activity

RalfJung commented on Oct 11, 2020

mahkoh commented on Oct 11, 2020

RalfJung commented on Oct 11, 2020

digama0 commented on Oct 11, 2020

RalfJung commented on Oct 11, 2020

RalfJung commented on Oct 11, 2020

digama0 commented on Oct 11, 2020

RalfJung commented on Oct 11, 2020

digama0 commented on Oct 11, 2020

sollyucko commented on Oct 11, 2020

digama0 commented on Oct 11, 2020

RalfJung commented on Oct 11, 2020

mahkoh commented on Oct 13, 2020

RalfJung commented on Oct 13, 2020

25 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions