- Feature Name:
pointer-match
- Start Date: 2019-03-21
- RFC PR: rust-lang/rfcs#0000
- Rust Issue: rust-lang/rust#0000
Extend match syntax and patterns by support for a limited set of operations for
pointers, which involve only address calculation and not actually reading
through the pointer value. Make it possible to use these matches to calculate
addresses of fields even for repr(packed)
structs and possibly unaligned
pointers where an intermediate reference must not be created.
To create a pointer to a field of a struct, there is currently no way in Rust that avoids creating a temporary reference. Since reference semantics are stricter, this may lead to silent undefined behaviour where that reference should not be valid. Depending on the resolution of reference semantics this affects:
- Creating a pointer to a field of a packed struct, where the reference may be unaligned (depending on .
- Pointing to fields of an uninitialized type, where the reference points to uninitialized data. This may be complicated by unions, where it could be possible that not a single variant is currently completely initialized, yet one wants to access some subfield. See rust-lang/unsafe-code-guidelines#73 (comment).
- Doing pointer offset calculations where the references does not refer to the
same, or any, allocation. This is because reference calculations are
performed with
getelementptr inbounds
.
Match expression are extended from support for a reference binding mode, to a
pointer binding mode. Furthermore, a new pattern binds to a pointer, and
identifiers are extended to allow a new mode similar to ref
and ref mut
binding to a reference. These patterns are called pointer pattern and raw
identifier for the remainder of the document.
#[repr(packed)]
struct Foo {
a: u16,
b: u32,
}
fn ptr_b(foo: &mut Foo) -> *mut u32 {
let raw mut b = foo.b;
b
}
Note that pointer binding mode and pointer pattern requires unsafe
, even when
it will never dereference the pointer. But the arithmetic on the pointer may
implicitely overflow. Furthermore, not all patterns are (yet) allowed, to avoid
implicitely performing an unintended, unsafe read through the pointer. Pointer
binding mode will at first only permit ultimately binding with raw
and ref
and not actually reading the contained memory.
The raw identifier pattern does not require unsafe
on its own (as seen above,
where we safely match a &mut
but bind to *mut
).
This is not only useful for packed fields, but also to access the fields of any object that is only available via pointer because its state invariants may not yet be fulfilled. Instead of manually doing pointer math:
/// Repr Rust, so no layout guarantees, no pointer operations to get to `a`.
struct Weird {
/// Always valid, no matter the memory content.
something_i_dont_care_about: u8,
/// Must only be one of `true` and `false` for a &Weird.
a: bool,
}
/// Unsafety invariants: `w` must
/// * point to some allocation of at least `std::mem::size_of::<Weird>`.
/// * point to memory valid for the chosen lifetime `'a`
/// * be properly aligned.
unsafe fn get_if_init<'a>(w: *const Weird) -> Option<&'a Weird> {
// No need to deref `w`.
let Weird { raw const a, ..} = w;
match core::ptr::read(a as *const u8) {
0 | 1 => Some(std::mem::transmute(w)),
_ => None
}
}
The newly introduced patterns are:
raw (const|mut) identifier
; allowed for field bindings and identifier bindings. These are allowed in the grammar whereref? mut? identifier
is allowed currently. For this purposeraw
is a contextual keyword.* (const|mut) <subpattern>
; to match a pointer not by value but to additionally use structural patterns to get pointers to the fields of its underlying type. Their use requires anunsafe
-block around the expression in which they appear, be it match or irrefutable bindings. However,<subpattern>
does not allow arbitrary content, this is subject to discussion and future options.
In pointer binding mode, non-top-level standard named bindings default to raw _
bindings depending on the pointer type. This should be analogue to
reference binding
mode.
Top-level bindings do not default to raw bindings and also need other
consideration for backwards compabitility.
const NULL: *mut usize = core::ptr::null();
match (0 as *mut usize) {
// What's currently possible, this is a binding pattern and does no pointer-wrapping.
x => (),
// Also currently possible, top-level value pattern also does no pointer-wrapping.
NULL => (),
// This pointer binding pattern gets a pointer to the pointer.
raw const y => (),
// This explicit pointer-pattern gets a mut pointer to the pointed-to place. We are in
// mutable pointer binding mode and the binding is not top-level. This one is useless.
*mut z = (),
// This explicit pointer-pattern gets a const pointer to the pointed-to place, like cast.
*mut raw const z => (),
}
In cases where the pattern is not a binding or a value pattern, pointer patterns may be elided and automatically wrap inner patterns by the compiler as required. This mechanism is currently used for references as well and there are no expected collisions as the wrapper is added according to the requirements of the matched type, which is either reference or pointer.
The calculation of the value from a pointer pattern will not use an inbounds
qualifier when passed to llvm.
There is no restriction on raw-patterns appearing within matching of enum variants and slices, such that this is possible:
#[repr(packed)]
Foo {
field: Enum,
}
enum Enum {
A(usize),
}
fn overwrite_packed_field(foo: &mut Foo ) {
// Actually safe!
let Foo { field: Enum::A(raw mut x), } = foo;
// Write itself not safe, as we write to a pointer :/
unsafe { ptr::write_unaligned(x, 0) };
}
Allowed patterns within
pointer patterns (and thus in the sugar of pointer binding mode) are: wildcard
patttern, path patterns that don't refer to enum variants or constants, struct
patterns, tuple patterns, fixed size array patterns, where the last three are
only not allowed to bind their fields with the new pointer pattern and with
..
, potentially also with ref mut? identifier
, but not mut? identifier
.
Some further notes on (dis-)allowed patterns:
- The restrictions don't apply to matching the pointer value itself, as that is not inside a pointer pattern.
- enum variants and constants obviously read their memory.
- literal, identifier, and reference patterns also constitute a read of the pointed-to place, and implicitely assert their type's invariants. Better to keep those operations separate.
- no pointer patterns within pointer patterns, must also actually read memory.
ref mut? identifier
may be useful, but may be too tempting sometimes. It essentially performs a cast of pointer-to-reference and thus comes with the same caveats: The programmer must ensure liveness and alignment. However, cast withtransmute
oras _
is much more explicit.
Pattern structure is enforced after the automatic addition of reference and pointer patterns. This ensures that even implicitely added pointer patterns can not result in a read through the pattern. The simplest of these accidental mistakes are however already prevent by the logic of adding pointer patterns as the pointer itself could be matched by a value pattern.
fn no_match_on_value(b: *const bool) -> usize {
match b {
// Error: mismatching type, expected value of type `*const bool` ...
true => 0,
// ^^^^ ... for this pattern
_ => 1,
}
}
struct Foo {
bar: usize,
}
fn no_match_on_implicit_value(b: *const Foo) -> usize {
match b {
// Error: can not match by value within pointer pattern.
Foo { bar: 0 } => 0,
// ^^^ .. referring to this value pattern
// Note: pointer pattern occurs in expansion to `*const Foo { bar: 0 }`
_ => 1,
}
}
Pointer patterns and raw bindings are also irrefutable patterns and can thus be
used in let
-bindings and similar. This was used in the example in the
guide-level explanation:
fn ptr_b(foo: &mut Foo) -> *mut u32 {
// rhs of this let-binding is a place expression to mutable field `b`.
// `raw mut` is irrefutable and gets a pointer to it without reference.
let raw mut b = foo.b;
b
}
Since pointer patterns are guaranteed to not rely on the pointed-to memory invariants, it can also be used to match union fields in interesting ways. Maybe this is interesting for custom enum-like-encapsulations?
union Mix {
f1: (bool, u8),
f2: (u8, bool),
}
let mut m = Mix { f2: (3, true), };
// f1.0 is not validly initialized, don't grab reference.
let raw const f1_0_ptr = m.f1.0;
// Initialize f1.0 through valid f2.0
m.f2.0 = 0;
// Now we can grab the reference.
let f1_0 = unsafe { &*f1_0_ptr };
Match unsized values should also simply work, I don't see any complication over matching those by reference as the pointer already includes the necessary (length)-metadata. With regards to network protocols, this would become much, much cooler with unsized unions but you can't have your cake and eat it, yet.
#[repr(C)]
struct net_pckt {
protocol_type: u8,
content: [u8],
};
let uninitialized_packet_ptr: *const net_pckt = unimplemented!();
unsafe {
// Works nicely even with changes to the packet structure.
let net_pckt { raw mut content, ..} = uninitialized_packet_ptr;
// Get a pointer two bytes into the content. The pointer has the necessary length-metadata.
match content {
[_, _, raw ptr] => /* Packet large enough */ (),
_ => return Err(Error::Truncated),
}
}
Match syntax is 'more heavy' than a place based expression syntax in some or many cases. On the other side of the coin, initializing a struct often involves grabbing pointers to all fields, where matching is much terser than each indivdual expression.
The additional pointer binding mode for match expressions may be confusing due to the non-explicit pointer nature of its argument.
The pointer retrieved from raw mut
binding while matching a &mut _
value
upholds more guarantees than aparent, as it is known to be writable with
ptr::write_unaligned
. Some yet-to-be-proposed encapsulation could thus make
this completely safe to the programmer. This is a drawback because of the next
argument.
Assigning semantics to the pattern matching of *
and raw
has the risk of
being too restricted for future operations but too constrained to allow
backwards compatible extension. Specifically, the type of id
in a raw id
pattern may be hard to change but a pointer upholds almost no invariants on its
own.
&raw <place>
was also proposed to achieve getting a pointer to a field. The
pattern/match syntax has several advantages over place syntax:
-
Place expressions are overloaded with auto-deref, custom indexing (
core::ops::Index
/core::ops::IndexMut
), invoking arbitrary user code. A solution with place syntax needs to explicitely forbid these forms of place statements, both to disallow user code and avoid accidental reference intermediates. The new statements thus resembles a very different other statement.Through a
raw
pattern, usable in irrefutable bindings, it is a choice of the programmer to use auto-deref within the rhs place statement, just as familar, but also to explicitely avoid it when used in the lhs-pattern. By the introduced pointer pattern it is also never required to rely on auto-deref within a pointer deref in anunsafe
block, where one could accidentally invoke aDeref::deref
implementation on an unintended reference to get a mentioned field (e.g. after a refactoring). -
The initial dereferencing of the pointer necessary for a place expression (
struct.field
is implicitely(*struct).field
for a reference argumentstruct
) will not work with pointer arguments, which do no automatically dereference even in unsafe code (and arguably should not, outside&raw
). -
raw
feels more natural when parallelingref
instead of appearing as yet an additional qualifier on&
that is not associated with pointers in the first place and confusingly also requiresconst
in spite of&
suggesting the opposite. -
It provides a clear pattern that extends to enum fields in packed structs, which are not absolutely not expressible in place syntax.
In contrast, patterns fully follow the structural nature of algebraic data
types without customization points in the form of core::ops
. This makes them
a perfect match when the possibilities should be restricted to exactly those
options.
Not doing this would keep surface level code for creating pointers error prone or impossible, independent of the underlying MIR changes.
C++ state-of-the-art, to my best knowledge, also uses the usual lvalue
expression for a pointer to a field. This has several pitfalls: Classes may
overwrite the pointer dereference operator ->
, and the pointer creation
operator &
. Actually conformant generic code thus requires additional
artificial constructs and a syntax that does not resemble lvalue syntax.
Additionally, most of the operator are not defined while their target object is
not life, making them unfit for initialization of uninitialized objects.
C (and C++ to an extent) also have offsetof
, a macro based solution to get
the byte offset of a field. This only works reliably for a very restricted set
of types. This
essentially is the analogue of #[repr(C)]
in Rust. A static_assert
based
solution can help unwittingly triggering undefined behaviour on other types.
No other algebraic language with the memory model of Rust is known to the author, thus comparisons in this way are sparse.
The PR #2582 contains the necessary MIR operations to perform the address calculations themselves.
The exact syntax for pointer patterns, while raw
as a contextual keyword has
already some association with pointer to place it need not be the final answer.
An alternative is, of course, a contextual keyword ptr
for that pattern.
However, ptr
will be more ambiguous should a similar syntax be adopted
outside of patterns.
The restrictions on pointer binding mode that are only based on not implicitely
reading memory (enum variants, constants, references, bindings) do not add real
safety, as the matching must occur within an unsafe
block in any case.
However, they likely do protect against accidental usage similar to auto-deref
in a place expression. They may arguable be more a nuisance than a safety help
nonetheless.
Address calculation will likely depend on not overflow the pointer, i.e. behave
like pointer::add
but could also utilize pointer::wrapping_add
instead.
That would make the code safer but provide fewer optimization opportunities.
Also, wrapping addition could promote use to get (specific) field offsets,
within the limits of layout guarantees offered by rust. Since it occurs in an
unsafe block, the burden of fulfilling necessary preconditions ultimately
relies on the programmer.
ref mut? identifier
within pointer patterns may be disallowed or not. raw identifier
pattern. For half-initialized structs where validity and alignment
of the underlying struct has been checked but &mut
referencing the complete
struct is not safe due to uninitialized fields this is also useful.
Alternatively, this could be disallowed if not useful enough or it seems to
promote undefined behaviour.
Some pointer binding matches may be safer than the required unsafe
suggests:
For example the pointer retrieved from MaybeUninit
guarantees that the memory
is actually backed by some allocation and thus the offset calculations can both
utilize inbounds
and will never overflow. It could be possible to remove the
need for an unsafe
block around such matches if they don't use any of the
memory-reading-patterns discussed in unresolved questions.