Tag Archives: rcu

Forcing memory barriers on other CPUs with mprotect(2)

I have something of an unfortunate fondness for indefensible hacks.

Like I discussed in my last post, RCU is a synchronization mechanism that excels at protecting read mostly data. It is a particularly useful technique in operating system kernels because full control of the scheduler permits many fairly simple and very efficient implementations of RCU.

In userspace, the situation is trickier, but still manageable. Mathieu Desnoyers and Paul E. McKenney have built a Userspace RCU library that contains a number of different implementations of userspace RCU. For reasons I won’t get into, efficient read side performance in userspace seems to depend on having a way for a writer to force all of the reader threads to issue a memory barrier. The URCU library has one version that does this using standard primitives: it sends signals to all other threads; in their signal handlers the other threads issue barriers and indicate so; the caller waits until every thread has done so. This is very heavyweight and inefficient because it requires running all of the threads in the process, even those that aren’t currently executing! Any thread that isn’t scheduled now has no reason to execute a barrier: it will execute one as part of getting rescheduled. Mathieu Desnoyers attempted to address this by adding a membarrier() system call to Linux that would force barriers in all other running threads in the process; after more than a dozen posted patches to LKML and a lot of back and forth, it got silently dropped.

While pondering this dilemma I thought of another way to force other threads to issue a barrier: by modifying the page table in a way that would force an invalidation of the Translation Lookaside Buffer (TLB) that caches page table entries! This can be done pretty easily with mprotect or munmap.

Full details in the patch commit message.

Why We Fight

Why We Fight, or

Why Your Language Needs A (Good) Memory Model, or

The Tragedy Of memory_order_consume’s Unimplementability

This, one of the most terrifying technical documents I’ve ever read, is why we fight: https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt.

Background

For background, RCU is a mechanism used heavily in the Linux kernel for locking around read-mostly data structures; that is, data structures that are read frequently but fairly infrequently modified. It is a scheme that allows for blazingly fast read-side critical sections (no atomic operations, no memory barriers, not even any writing to cache lines that other CPUs may write to) at the expense of write-side critical sections being quite expensive.

The catch is that writers might be modifying the data structure as readers access it: writers are allowed to modify the data structure (often a linked list) as long as they do not free any memory removed until it is “safe”. Since writers can be modifying data structures as readers are reading from it, without any synchronization between them, we are now in danger of running afoul of memory reordering. In particular, if a writer initializes some structure (say, a routing table entry) and adds it to an RCU protected linked list, it is important that any reader that sees that the entry has been added to the list also sees the writes that initialized the entry! While this will always be the case on the well-behaved x86 processor, architectures like ARM and POWER don’t provide this guarantee.

The simple solution to make the memory order work out is to add barriers on both sides on platforms where it is need: after initializing the object but before adding it to the list and after reading a pointer from the list but before accessing its members (including the next pointer). This cost is totally acceptable on the write-side, but is probably more than we are willing to pay on the read-side. Fortunately, we have an out: essentially all architectures (except for the notoriously poorly behaved Alpha) will not reorder instructions that have a data dependency between them. This means that we can get away with only issuing a barrier on the write-side and taking advantage of the data dependency on the read-side (between loading a pointer to an entry and reading fields out of that entry). In Linux this is implemented with macros “rcu_assign_pointer” (that issues a barrier if necessary, and then writes the pointer) on the write-side and “rcu_dereference” (that reads the value and then issues a barrier on Alpha) on the read-side.

There is a catch, though: the compiler. There is no guarantee that something that looks like a data dependency in your C source code will be compiled as a data dependency. The most obvious way to me that this could happen is by optimizing “r[i ^ i]” or the like into “r[0]”, but there are many other ways, some quite subtle. This document, linked above, is the Linux kernel team’s effort to list all of the ways a compiler might screw you when you are using rcu_dereference, so that you can avoid them.

This is no way to run a railway.

Language Memory Models

Programming by attempting to quantify over all possible optimizations a compiler might perform and avoiding them is a dangerous way to live. It’s easy to mess up, hard to educate people about, and fragile: compiler writers are feverishly working to invent new optimizations that will violate the blithe assumptions of kernel writers! The solution to this sort of problem is that the language needs to provide the set of concurrency primitives that are used as building blocks (so that the compiler can constrain its code transformations as needed) and a memory model describing how they work and how they interact with regular memory accesses (so that programmers can reason about their code). Hans Boehm makes this argument in the well-known paper Threads Cannot be Implemented as a Library.

One of the big new features of C++11 and C11 is a memory model which attempts to make precise what values can be read by threads in concurrent programs and to provide useful tools to programmers at various levels of abstraction and simplicity. It is complicated, and has a lot of moving parts, but overall it is definitely a step forward.

One place it falls short, however, is in its handling of “rcu_dereference” style code, as described above. One of the possible memory orders in C11 is “memory_order_consume”, which establishes an ordering relationship with all operations after it that are data dependent on it. There are two problems here: first, these operations deeply complicate the semantics; the C11 memory model relies heavily on a relation called “happens before” to determine what writes are visible to reads; with consume, this relation is no longer transitive. Yuck! Second, it seems to be nearly unimplementable; tracking down all the dependencies and maintaining them is difficult, and no compiler yet does it; clang and gcc both just emit barriers. So now we have a nasty semantics for our memory model and we’re still stuck trying to reason about all possible optimizations. (There is work being done to try to repair this situation; we will see how it turns out.)

Shameless Plug

My advisor, Karl Crary, and I are working on designing an alternate memory model (called RMC) for C and C++ based on explicitly specifying the execution and visibility constraints that you depend on. We have a paper on it and I gave a talk about it at POPL this year. The paper is mostly about the theory, but the talk tried to be more practical, and I’ll be posting more about RMC shortly. RMC is quite flexible. All of the C++11 model apart from consume can be implemented in terms of RMC (although that’s probably not the best way to use it) and consume style operations are done in a more explicit and more implementable (and implemented!) way.