Skip to content

Latest commit

 

History

History
397 lines (332 loc) · 16.6 KB

0000-pointer-match.md

File metadata and controls

397 lines (332 loc) · 16.6 KB

Summary

Extend match syntax and patterns by support for a limited set of operations for pointers, which involve only address calculation and not actually reading through the pointer value. Make it possible to use these matches to calculate addresses of fields even for repr(packed) structs and possibly unaligned pointers where an intermediate reference must not be created.

Motivation

To create a pointer to a field of a struct, there is currently no way in Rust that avoids creating a temporary reference. Since reference semantics are stricter, this may lead to silent undefined behaviour where that reference should not be valid. Depending on the resolution of reference semantics this affects:

  • Creating a pointer to a field of a packed struct, where the reference may be unaligned (depending on .
  • Pointing to fields of an uninitialized type, where the reference points to uninitialized data. This may be complicated by unions, where it could be possible that not a single variant is currently completely initialized, yet one wants to access some subfield. See rust-lang/unsafe-code-guidelines#73 (comment).
  • Doing pointer offset calculations where the references does not refer to the same, or any, allocation. This is because reference calculations are performed with getelementptr inbounds.

Guide-level explanation

Match expression are extended from support for a reference binding mode, to a pointer binding mode. Furthermore, a new pattern binds to a pointer, and identifiers are extended to allow a new mode similar to ref and ref mut binding to a reference. These patterns are called pointer pattern and raw identifier for the remainder of the document.

#[repr(packed)]
struct Foo {
    a: u16,
    b: u32,
}

fn ptr_b(foo: &mut Foo) -> *mut u32 {
    let raw mut b = foo.b;
    b
}

Note that pointer binding mode and pointer pattern requires unsafe, even when it will never dereference the pointer. But the arithmetic on the pointer may implicitely overflow. Furthermore, not all patterns are (yet) allowed, to avoid implicitely performing an unintended, unsafe read through the pointer. Pointer binding mode will at first only permit ultimately binding with raw and ref and not actually reading the contained memory.

The raw identifier pattern does not require unsafe on its own (as seen above, where we safely match a &mut but bind to *mut).

This is not only useful for packed fields, but also to access the fields of any object that is only available via pointer because its state invariants may not yet be fulfilled. Instead of manually doing pointer math:

/// Repr Rust, so no layout guarantees, no pointer operations to get to `a`.
struct Weird {
    /// Always valid, no matter the memory content.
    something_i_dont_care_about: u8,

    /// Must only be one of `true` and `false` for a &Weird.
    a: bool,
}

/// Unsafety invariants: `w` must 
/// * point to some allocation of at least `std::mem::size_of::<Weird>`.
/// * point to memory valid for the chosen lifetime `'a`
/// * be properly aligned.
unsafe fn get_if_init<'a>(w: *const Weird) -> Option<&'a Weird> {
    // No need to deref `w`.
    let Weird { raw const a, ..} = w;
    match core::ptr::read(a as *const u8) {
    	0 | 1 => Some(std::mem::transmute(w)),
	_ => None
    }
}

Reference-level explanation

The newly introduced patterns are:

  • raw (const|mut) identifier; allowed for field bindings and identifier bindings. These are allowed in the grammar where ref? mut? identifier is allowed currently. For this purpose raw is a contextual keyword.
  • * (const|mut) <subpattern>; to match a pointer not by value but to additionally use structural patterns to get pointers to the fields of its underlying type. Their use requires an unsafe-block around the expression in which they appear, be it match or irrefutable bindings. However, <subpattern> does not allow arbitrary content, this is subject to discussion and future options.

In pointer binding mode, non-top-level standard named bindings default to raw _ bindings depending on the pointer type. This should be analogue to reference binding mode. Top-level bindings do not default to raw bindings and also need other consideration for backwards compabitility.

const NULL: *mut usize = core::ptr::null();
match (0 as *mut usize) {
    // What's currently possible, this is a binding pattern and does no pointer-wrapping.
    x => (),
    // Also currently possible, top-level value pattern also does no pointer-wrapping.
    NULL => (),
    // This pointer binding pattern gets a pointer to the pointer.
    raw const y => (),
    // This explicit pointer-pattern gets a mut pointer to the pointed-to place. We are in
    // mutable pointer binding mode and the binding is not top-level. This one is useless.
    *mut z = (),
    // This explicit pointer-pattern gets a const pointer to the pointed-to place, like cast.
    *mut raw const z => (),
}

In cases where the pattern is not a binding or a value pattern, pointer patterns may be elided and automatically wrap inner patterns by the compiler as required. This mechanism is currently used for references as well and there are no expected collisions as the wrapper is added according to the requirements of the matched type, which is either reference or pointer.

The calculation of the value from a pointer pattern will not use an inbounds qualifier when passed to llvm.

There is no restriction on raw-patterns appearing within matching of enum variants and slices, such that this is possible:

#[repr(packed)]
Foo {
    field: Enum,
}

enum Enum {
    A(usize),
}

fn overwrite_packed_field(foo: &mut Foo ) {
    // Actually safe!
    let Foo { field: Enum::A(raw mut x), } = foo;

    // Write itself not safe, as we write to a pointer :/
    unsafe { ptr::write_unaligned(x, 0) };
}

Allowed patterns within pointer patterns (and thus in the sugar of pointer binding mode) are: wildcard patttern, path patterns that don't refer to enum variants or constants, struct patterns, tuple patterns, fixed size array patterns, where the last three are only not allowed to bind their fields with the new pointer pattern and with .., potentially also with ref mut? identifier, but not mut? identifier. Some further notes on (dis-)allowed patterns:

  • The restrictions don't apply to matching the pointer value itself, as that is not inside a pointer pattern.
  • enum variants and constants obviously read their memory.
  • literal, identifier, and reference patterns also constitute a read of the pointed-to place, and implicitely assert their type's invariants. Better to keep those operations separate.
  • no pointer patterns within pointer patterns, must also actually read memory.
  • ref mut? identifier may be useful, but may be too tempting sometimes. It essentially performs a cast of pointer-to-reference and thus comes with the same caveats: The programmer must ensure liveness and alignment. However, cast with transmute or as _ is much more explicit.

Pattern structure is enforced after the automatic addition of reference and pointer patterns. This ensures that even implicitely added pointer patterns can not result in a read through the pattern. The simplest of these accidental mistakes are however already prevent by the logic of adding pointer patterns as the pointer itself could be matched by a value pattern.

fn no_match_on_value(b: *const bool) -> usize {
    match b {
        // Error: mismatching type, expected value of type `*const bool` ...
        true => 0,
     // ^^^^ ... for this pattern
	_ => 1,
    }
}

struct Foo {
    bar: usize,
}

fn no_match_on_implicit_value(b: *const Foo) -> usize {
    match b {
	// Error: can not match by value within pointer pattern.
        Foo { bar: 0 } => 0,
        //        ^^^ .. referring to this value pattern
	// Note: pointer pattern occurs in expansion to `*const Foo { bar: 0 }`
	_ => 1,
    }
}

Pointer patterns and raw bindings are also irrefutable patterns and can thus be used in let-bindings and similar. This was used in the example in the guide-level explanation:

fn ptr_b(foo: &mut Foo) -> *mut u32 {
    // rhs of this let-binding is a place expression to mutable field `b`.
    // `raw mut` is irrefutable and gets a pointer to it without reference.
    let raw mut b = foo.b;
    b
}

Since pointer patterns are guaranteed to not rely on the pointed-to memory invariants, it can also be used to match union fields in interesting ways. Maybe this is interesting for custom enum-like-encapsulations?

union Mix {
    f1: (bool, u8),
    f2: (u8, bool),
}

let mut m = Mix { f2: (3, true), };
// f1.0 is not validly initialized, don't grab reference.
let raw const f1_0_ptr = m.f1.0;
// Initialize f1.0 through valid f2.0
m.f2.0 = 0;
// Now we can grab the reference.
let f1_0 = unsafe { &*f1_0_ptr };

Match unsized values should also simply work, I don't see any complication over matching those by reference as the pointer already includes the necessary (length)-metadata. With regards to network protocols, this would become much, much cooler with unsized unions but you can't have your cake and eat it, yet.

#[repr(C)]
struct net_pckt {
    protocol_type: u8,
    content: [u8],
};

let uninitialized_packet_ptr: *const net_pckt = unimplemented!();
unsafe {
    // Works nicely even with changes to the packet structure.
    let net_pckt { raw mut content, ..} = uninitialized_packet_ptr;
    // Get a pointer two bytes into the content. The pointer has the necessary length-metadata.
    match content {
        [_, _, raw ptr] => /* Packet large enough */ (),
	_ => return Err(Error::Truncated),
    } 
}

Drawbacks

Match syntax is 'more heavy' than a place based expression syntax in some or many cases. On the other side of the coin, initializing a struct often involves grabbing pointers to all fields, where matching is much terser than each indivdual expression.

The additional pointer binding mode for match expressions may be confusing due to the non-explicit pointer nature of its argument.

The pointer retrieved from raw mut binding while matching a &mut _ value upholds more guarantees than aparent, as it is known to be writable with ptr::write_unaligned. Some yet-to-be-proposed encapsulation could thus make this completely safe to the programmer. This is a drawback because of the next argument.

Assigning semantics to the pattern matching of * and raw has the risk of being too restricted for future operations but too constrained to allow backwards compatible extension. Specifically, the type of id in a raw id pattern may be hard to change but a pointer upholds almost no invariants on its own.

Rationale and alternatives

&raw <place> was also proposed to achieve getting a pointer to a field. The pattern/match syntax has several advantages over place syntax:

  • Place expressions are overloaded with auto-deref, custom indexing (core::ops::Index/core::ops::IndexMut), invoking arbitrary user code. A solution with place syntax needs to explicitely forbid these forms of place statements, both to disallow user code and avoid accidental reference intermediates. The new statements thus resembles a very different other statement.

    Through a raw pattern, usable in irrefutable bindings, it is a choice of the programmer to use auto-deref within the rhs place statement, just as familar, but also to explicitely avoid it when used in the lhs-pattern. By the introduced pointer pattern it is also never required to rely on auto-deref within a pointer deref in an unsafe block, where one could accidentally invoke a Deref::deref implementation on an unintended reference to get a mentioned field (e.g. after a refactoring).

  • The initial dereferencing of the pointer necessary for a place expression (struct.field is implicitely (*struct).field for a reference argument struct) will not work with pointer arguments, which do no automatically dereference even in unsafe code (and arguably should not, outside &raw).

  • raw feels more natural when paralleling ref instead of appearing as yet an additional qualifier on & that is not associated with pointers in the first place and confusingly also requires const in spite of & suggesting the opposite.

  • It provides a clear pattern that extends to enum fields in packed structs, which are not absolutely not expressible in place syntax.

In contrast, patterns fully follow the structural nature of algebraic data types without customization points in the form of core::ops. This makes them a perfect match when the possibilities should be restricted to exactly those options.

Not doing this would keep surface level code for creating pointers error prone or impossible, independent of the underlying MIR changes.

Prior art

C++ state-of-the-art, to my best knowledge, also uses the usual lvalue expression for a pointer to a field. This has several pitfalls: Classes may overwrite the pointer dereference operator ->, and the pointer creation operator &. Actually conformant generic code thus requires additional artificial constructs and a syntax that does not resemble lvalue syntax. Additionally, most of the operator are not defined while their target object is not life, making them unfit for initialization of uninitialized objects.

C (and C++ to an extent) also have offsetof, a macro based solution to get the byte offset of a field. This only works reliably for a very restricted set of types. This essentially is the analogue of #[repr(C)] in Rust. A static_assert based solution can help unwittingly triggering undefined behaviour on other types.

No other algebraic language with the memory model of Rust is known to the author, thus comparisons in this way are sparse.

The PR #2582 contains the necessary MIR operations to perform the address calculations themselves.

Unresolved questions

The exact syntax for pointer patterns, while raw as a contextual keyword has already some association with pointer to place it need not be the final answer. An alternative is, of course, a contextual keyword ptr for that pattern. However, ptr will be more ambiguous should a similar syntax be adopted outside of patterns.

The restrictions on pointer binding mode that are only based on not implicitely reading memory (enum variants, constants, references, bindings) do not add real safety, as the matching must occur within an unsafe block in any case. However, they likely do protect against accidental usage similar to auto-deref in a place expression. They may arguable be more a nuisance than a safety help nonetheless.

Address calculation will likely depend on not overflow the pointer, i.e. behave like pointer::add but could also utilize pointer::wrapping_add instead. That would make the code safer but provide fewer optimization opportunities. Also, wrapping addition could promote use to get (specific) field offsets, within the limits of layout guarantees offered by rust. Since it occurs in an unsafe block, the burden of fulfilling necessary preconditions ultimately relies on the programmer.

ref mut? identifier within pointer patterns may be disallowed or not. raw identifier pattern. For half-initialized structs where validity and alignment of the underlying struct has been checked but &mut referencing the complete struct is not safe due to uninitialized fields this is also useful. Alternatively, this could be disallowed if not useful enough or it seems to promote undefined behaviour.

Future possibilities

Some pointer binding matches may be safer than the required unsafe suggests: For example the pointer retrieved from MaybeUninit guarantees that the memory is actually backed by some allocation and thus the offset calculations can both utilize inbounds and will never overflow. It could be possible to remove the need for an unsafe block around such matches if they don't use any of the memory-reading-patterns discussed in unresolved questions.