Skip to content

What about: volatile, concurrency, and interaction with untrusted threads #152

Closed
@hsivonen

Description

@hsivonen
Member

Designate outside-of-program changes to memory accessed by volatile as non-UB

Context: An internals thread.

Use case: Rust is used to write privileged code (host services provided by a runtime environment to a JITed language, OS kernel providing syscall services to userland code, or a hypervisor providing emulated devices to a guest system) that needs to access memory that also unprivileged code can access and the unprivileged code can have multiple threads such that while unprivileged thread A has requested a service from the host such that the host service is running logically on A's thread of execution, a separate unprivileged thread of execution B could, if it is behaving badly, concurrently access the same memory from another CPU core. The unprivileged thread of execution must not be allowed to cause the privileged code written in Rust to experience UB. (It's fine for the unprivileged code to cause itself to experience UB within the bounds of its sandbox.)

The memory model itself is a whole-program model, so it doesn't apply, since in order to provide the guarantees it pledges to our thread, we must pledge the absence of data races from other threads of execution, which we can't do in this case. Hence, we need a way to access memory that is outside the memory model in the sense that there could exist an adversarial additional thread of execution that doesn't adhere to the DRF requirement. We're not trying to communicate with that thread of execution. The issue is just not letting it cause security bugs on us.

The C++ paper P1152R0 "Deprecating volatile" gives this use case as the very first item on its list of legitimate uses of volatile in C and C++. This makes sense, since if volatile works when external changes are caused by memory-mapped IO (the use case documented for std::ptr::read_volatile and the original use case motivating the existence of volatile in C), given the codegen for volatile and codegen for relaxed atomics on architectures presently supported by Rust, it makes sense for it to also work also when external changes are caused by a rogue thread of execution.

Yet, the documentation says: "a race between a read_volatile and any write operation to the same location is undefined behavior". I believe it's unnecessary and harmful to designate this as UB and it would be sufficient to merely say that the values returned by read_volatile are unpredictable in that case. This makes sense in the light of an IO-like view of volatile: You need to be prepared to receive any byte from an IO stream, so not knowing at compile time what you are going to get does not have to be program-destroying UB if you are prepared to receive value not predicted at compile time.

I suggest that a) the documentation be changed not to designate concurrent external modification of memory locations that a Rust program only accesses as volatile to be UB and b) to state in the Unsafe Code Guidelines that it's legitimate to use volatile accesses to access memory that a thread of execution external to the Rust program might change concurrently. That is, while you may read garbage, the optimizer won't assume that two volatile reads from the same location yield the same value and won't invent reads from memory locations written to using volatile writes (i.e. the memory locations are considered shared and, therefore, ineligible to be used as spill space by the compiler).

Replies to the thread linked to above indicate that this should already be the case despite the documentation suggesting otherwise.

Also see #152 (comment) which tries to summarize the discussion-until-then a bit.

Activity

gnzlbg

gnzlbg commented on Jun 25, 2019

@gnzlbg
Contributor

I don't think we need to distinguish between internal and external threads of execution. A thread of execution is just that, and the concurrent semantics of volatile should be specified for those in general.

Volatile loads and stores are not guaranteed to be atomic (and synchronize in any particular way), that is, using them to concurrently read/write to some memory could introduce a data-race.

Suppose you have an [u16] in shared memory, and two threads of execution modify it by writing either 0x00 or 0xff to it. If you use properly synchronized memory accesses, you are guaranteed to always read either 0x00 or 0xff from the array in any of the processes. If you use volatile loads / stores, one thread of execution could read 0xf0 or 0x0f. This can easily result in mis-optimizations even if the two threads of execution are run in different processes, e.g. if the program uses a type that properly expresses these semantics (e.g. a NonZero-like type), hint::assume annotations, etc.

Now, when people use volatile loads / stores to access IO registers, they exploit the platform-specific knowledge that the volatile load / store is atomic for that particular register or memory access size on that particular architecture.

For example, when reading a temperature from a sensor by using a volatile load on a 32-bit register, the hardware often guarantees atomicity for memory loads, making it impossible to observe a partially modified result. Here you can get away with a volatile load because the toolchain and the hardware conspire together to avoid a data-race.

We could document that whether volatile loads and stores are atomic and synchronize in any particular way for a particular memory addresses, memory access sizes, target, target features, target cpu, etc. is unspecified. That is, implementations are not required to document this behavior, and the volatile loads and stores are not guaranteed to be atomic. If misuse introduces a data-race, the behavior is undefined.

AFAICT, what we cannot do is guarantee that what you are trying to accomplish always works (for all Ts, on all targets, etc.).

HadrienG2

HadrienG2 commented on Jun 25, 2019

@HadrienG2

I think one additional thing which @hsivonen needs is for data races to produce sane effects ("garbage data" and that's it) when the race occurs on a "bag of bits" data type which has neither padding nor invalid bit patterns, such as u8, u16...

It's a topic that comes up frequently in unsafe code discussions though, so perhaps there's already an UCG topic on this kind of matter that we can study for prior art.

HadrienG2

HadrienG2 commented on Jun 25, 2019

@HadrienG2

A related issue, which is mostly of theoretical interest in multi-process scenarios as current implementations don't really have another option than to do the right thing, is that an adversarial thread writing invalid data into the shared memory space (e.g. mem::uninitialized::<u8>()) should not trigger UB in a privileged thread reading from it with proper precautions, given again a restriction to data types for which all bit patterns are valid.

comex

comex commented on Jun 25, 2019

@comex

I think volatile should be documented and specified along these general lines:

  • Unlike almost everything else in the language, volatile defers to hardware semantics. volatile_load means "emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

  • Accordingly, any properties involving atomicity, memory ordering, or handling of uninitialized data are guaranteed iff the underlying architecture guarantees them. Rust does not require its supported architectures to make any such guarantees; an architecture could, e.g., make racing volatile accesses undefined behavior, although no current Rust architectures do so. On the other hand, when compiling for an architecture that does provide guarantees, the compiler will not perform optimizations that break code that relies on those guarantees.

[What about miri?]

comex

comex commented on Jun 25, 2019

@comex

(Sorry for double post -)

But I also think we should have a better answer for shared memory than volatile.

In particular, as discussed in the internals thread, we may want to guarantee that the UB caused by races between atomic and non-atomic accesses, if the accesses are in different processes, only affects the process performing the non-atomic access. In other words, you can safely use atomic accesses on shared memory even if your communication partner might be malicious – at least with regards to that particular source of UB. That seems like a reasonable guarantee.

On the other hand, there are other, more plausible ways that an architecture could hypothetically break this sort of IPC. For example, it could give each byte of memory an "initializedness" status, such that if process A writes uninitialized data to memory and process B reads it, process B gets an uninitialized value and traps if it tries to use it. (Note that Itanium does not do this; it tracks initializedness for registers, but not for memory.)

HadrienG2

HadrienG2 commented on Jun 25, 2019

@HadrienG2

In an ideal world, there would be a way for the untrusting thread to state "I know that I might be reading uninitialized or otherwise fishy memory, please let me do so and return arbitrary bytes on incorrect usage".

Kind of like the ptr::freeze that was proposed there, but with transactional semantics to account for the racy nature of the situation, i.e. fn(*mut T) -> T.

However, I have no idea how to make that actually interact correctly with emulations of hardware that can track use of uninitialized memory but provide no way for software to announce voluntary access to uninitialized memory, such as valgrind.

gnzlbg

gnzlbg commented on Jun 25, 2019

@gnzlbg
Contributor

"emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

I find that using "emit exactly one load instruction" to denote potentially many actual hardware load instructions confusing.

comex

comex commented on Jun 25, 2019

@comex

Feel free to improve the wording :)

I mean of course that there is one load instruction per volatile_load call.

Though even that is slightly imprecise. The compiler can still inline or otherwise duplicate code, in which case a given binary could contain multiple locations corresponding to a single call to volatile_load in the source code, each of which would have its own load instruction.

Perhaps it's better to say: Any given execution of volatile_load will be performed by executing a single load instruction.

gnzlbg

gnzlbg commented on Jun 25, 2019

@gnzlbg
Contributor

@comex

I mean of course that there is one load instruction per volatile_load call.

This:

#[no_mangle] pub unsafe fn foo(x: *mut [u8; 4096]) -> [u8; 4096] { x.read_volatile() }

generates ~16000 instructions on godbolt. I don't know of any architecture in which this could actually be lowered to exactly one load instruction.

petrochenkov

petrochenkov commented on Jun 25, 2019

@petrochenkov

[u8; 4096]

I assume @comex would expect this to be written as a loop.
If this model is followed, it would actually be nice to report a warning/error from the backend if the read_volatile/write_volatile cannot be lowered into a single load/store with the given size and alignment.

gnzlbg

gnzlbg commented on Jun 25, 2019

@gnzlbg
Contributor

@petrochenkov How would this be implemented?

No matter how I look at this, I see a lot of issues.

If we guarantee that (*mut T).read_volatile() either lowers to a single instruction or compilation fails, unsafe code will rely on that for safety, so this would need to be a guaranteed compilation error.

One issue is that we can only emit this error at monomorphization time. I'm not sure how we could fix that.

Another issue is that this would be a breaking change, but I suppose that we could either add new APIs and deprecate the old ones, or somehow emit this error only in newer editions (these operations are almost intrinsics).

I wonder how the implementation would look like. AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic. I suppose we could generates "tests" during compilation, in which we count the instructions, but that seems brittle.

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

comex

comex commented on Jun 25, 2019

@comex

@gnzlbg

Good point – volatile accesses of types that don't correspond to machine register types are somewhat ill-defined, AFAIK. But, e.g.

#[no_mangle] pub unsafe fn foo(x: *mut u64) -> u64 { x.read_volatile() }

should definitely be guaranteed to be a single load instruction on x86-64; same goes for smaller integer types.

Of course there's limited room to make decisions here since we're largely at the mercy of LLVM, but I'd say the rule is roughly "if the 'obvious' way to translate this load is with a single instruction, it has to be a single instruction".

Personally, I'd prefer if volatile produced a compile error in other cases, but that ship has sailed.

The rule gets less clear when SIMD gets involved. x86-64 can perform 128-bit loads via SIMD, and currently, a volatile_load of packed_simd::f32x4 does indeed expand to such an instruction:

	movaps	(%rdi), %xmm0

On the other hand, a volatile_load of u128 expands to two 64-bit regular loads:

	movq	(%rdi), %rax
	movq	8(%rdi), %rdx

I'd say this is defensible, because even if the architecture has some way to perform a 128-bit load, that's not the same as there being an 'obvious' way to load a 128-bit integer. On the other hand, with f32x4 we are explicitly requesting SIMD, so it's arguably reasonable to guarantee that it uses an appropriate SIMD instruction.

But in any case, it doesn't really matter whether a SIMD load is guaranteed, since movaps is not guaranteed to be atomic at an architectural level, so the difference between the two is not really observable.

comex

comex commented on Jun 25, 2019

@comex

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

I'd say that the 'architecture' here is Wasm, not whatever it's ultimately compiled into. Just as x86-64 has 128-bit load instructions that aren't guaranteed to be atomic, Wasm apparently doesn't guarantee that loads of any size are atomic, unless you use the special atomic instructions. But that's fine; it's already established that the set of guarantees provided depends on the architecture.

However, Wasm is arguably a motivating use case for exposing additional intrinsics for loads and stores marked both atomic and volatile (which LLVM supports).

petrochenkov

petrochenkov commented on Jun 25, 2019

@petrochenkov

AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic.

That's exactly what I meant by "from the backend" - a target-specific LLVM backend, before that point no one knows about instruction specifics.
I don't know whether LLVM has necessary infrastructure or not, I'm just saying that having it would be useful.
(This certainly shouldn't be any kind of language-level guarantee, only a lint-like diagnostic, even if it's highly reliable.)

comex

comex commented on Jun 26, 2019

@comex

I'm a little confused what the argument is about, but to be clear – for MMIO to work correctly, volatile loads and stores of basic integer types (specifically, ones that can fit into a register) must be done using a single load/store instruction of the correct width. So the behavior with those types needs to be a language-level guarantee. That should probably also apply to #[repr(transparent)] wrappers around those types.

The behavior of volatile with other types, as I said, is relatively ill-defined. In most cases it does seem fairly likely to be a mistake – e.g. if the [u8; 4096] example were in real code, it would be unlikely that the author actually meant to generate that giant function that copies the bytes one-by-one. Or if, say, someone used a struct to represent a block of MMIO registers, they might accidentally write my_struct.volatile_load().field instead of (&raw my_struct.field).volatile_load(). So it might make sense to produce some sort of diagnostic. On the other hand, I could imagine volatile accesses with large types being done intentionally for the shared-memory use case.

134 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-memoryTopic: Related to memory accessesC-open-questionCategory: An open question that we should revisitTriagedVisited during a backlog bonanza meeting

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @comex@rpjohnst@Amanieu@RalfJung@asajeffrey

        Issue actions

          What about: volatile, concurrency, and interaction with untrusted threads · Issue #152 · rust-lang/unsafe-code-guidelines