Author Archives: sully

MicroKanren (μKanren) in Haskell

Our PL reading group read the paper “μKanren: A Minimal Functional Core for Relational Programming” this week. It presents a minimalist logic programming language in Scheme in 39 lines of code. Since none of us are really Schemers, a bunch of us quickly set about porting the code to our personal pet languages. Chris Martens produced this SML version. I hacked up a version in Haskell.

The most interesting part about this was the mistake I made in the initial version. To deal with recursion and potentially infinite search trees, the Scheme version allows some laziness; streams of results can be functions that delay search until forced; when a Scheme μKanren program wants to create a recursive relation it needs wrap the recursive call in a dummy function (and plumb through the input state); the Scheme version wraps this in a macro called Zzz to make doing it more palatable. I originally thought that all of this could be dispensed with in Haskell; since Haskell is lazy, no special work needs to be done to prevent self reference from causing an infinite loop. It served an important secondary purpose, though: providing a way to detect recursion so that we can switch which branch of the tree we are exploring. Without this, although the fives test below works, the fivesRev test infinite loops without producing anything.

The initial version was also more generalized. The type signatures allowed for operating over any MonadPlus, thus allowing pluggable search strategies. KList was just a newtype wrapper around lists. When I had to add delay I could have defined a new MonadPlusDelay typeclass and parametrized over that, but it didn’t’ seem worthwhile.

A mildly golfed version that drops blank lines, type annotations, comments, aliases, and test code clocks in at 33 lines.

Parallelizing compiles without parallelizing linking – using make

I have to build LLVM and Clang a lot for my research. Clang/LLVM is quite large and takes a long time to build if I don’t use -j8 or so to parallelize the build; but I also quickly discovered that parallelizing the build didn’t work either! I work on a laptop with 8gb of RAM and while this can easily handle 8 parallel compiles, 8 parallel links plus Firefox and Emacs and everything else is a one way ticket to swap town.

So I set about finding a way to parallelize the compiles but not the links. Here I am focusing on building an existing project. There are probably nicer ways that someone writing the Makefile could use to make this easier for people or the default, but I haven’t really thought about that.

My first attempt was the hacky (while ! pgrep ld.bfd.real; do sleep 1; done; killall make ld.bfd.real) & make -j8; sleep 2; make. Here we wait until a linker has run, kill make, then rerun make without parallel execution. I expanded this into a more general script:

This approach is kind of terrible. It’s really hacky, it has a concurrency bug (that I would fix if the whole thing wasn’t already so bad), and it slows things down way more than necessary; as soon as one link has started, nothing more is done in parallel.

A better approach is by using locking to make sure only one link command can run at a time. There is a handy command, flock, that does just that: it uses a file link to serialize execution of a command. We can just replace the Makefile’s linker command with a command that calls flock and everything will sort itself out. Unfortunately there is no totally standard way for Makefiles to represent how they do linking, so some Makefile source diving becomes necessary. (Many use $(LD); LLVM does not.) With LLVM, the following works: make -j8 'Link=flock /tmp/llvm-build $(Compile.Wrapper) $(CXX) $(CXXFLAGS) $(LD.Flags) $(LDFLAGS) $(TargetCommonOpts) $(Strip)'

That’s kind of nasty, and we can do a bit better. Many projects use $(CC) and/or $(CXX) as their underlying linking command; if we override that with something that uses flock then we’ll wind up serializing compiles as well as links. My hacky solution was to write a wrapper script that scans its arguments for “-c”; if it finds a “-c” it assumes it is a compile, otherwise it assumes it is a link and uses locking. We can then build LLVM with: make -j8 'CXX=lock-linking /tmp/llvm-build-lock clang++'.

Is there a better way to do this sort of thing? Continue reading

The x86 Memory Model

Often I’ve found myself wanting to point someone to a description of the x86’s memory model, but there wasn’t any that quite laid it out the way I wanted. So this is my take on how shared memory works on multiprocessor x86 systems. The guts of this description is adapted/copied from “A Better x86 Memory Model: x86-TSO” by Scott Owens, Susmit Sarkar, and Peter Sewell; this presentation strips away most of the math and presents it in a more operational style. Any mistakes are almost certainly mine and not theirs.

Components of the System:

There is a memory subsystem that supports the following operations: store, load, fence, lock, unlock. The memory subsystem contains the following:

  1. Memory: A map from addresses to values
  2. Write buffers: Per-processor lists of (address, value) pairs; these are pending writes, waiting to be sent to memory
  3. “The Lock”: Which processor holds the lock, or None, if it is not held. Roughly speaking, while the lock is held, only the processor that holds it can perform memory operations.

There is a set of processors that execute instructions in program order, dispatching commands to the memory subsystem when they need to do memory operations. Atomic instructions are implemented by taking “the lock”, doing whatever reads and writes are necessary, and then dropping “the lock”. We abstract away from this.

Definitions

A processor is “not blocked” if either the lock is unheld or it holds the lock.

Memory System Operation

Processors issue commands to the memory subsystem. The subsystem loops, processing commands; each iteration it can pick the command issued by any of the processors to execute. (Each will only have one.) Some of the commands issued by processors may not be eligible to execute because their preconditions do not hold.

  1. If a processor p wants to read from address a and p is not blocked:
    a. If there are no pending writes to a in p’s write buffer, return the value from memory
    b. If there is a pending write to a in p’s write buffer, return the most recent value in the write buffer
  2. If a processor p wants to write value v to address a, add (a, v) to the back of p’s write buffer
  3. At any time, if a processor p is not blocked, the memory subsystem can remove the oldest entry (a, v) from p’s write buffer and update memory so that a maps to v
  4. If a processor p wants to issue a barrier
    a. If the barrier is an MFENCE, p’s write buffer must be empty
    b. If the barrier is an LFENCE/SFENCE, there are no preconditions; these are no-ops **
  5. If a processor p’s wants to lock the lock, the lock must not be held and p’s write buffer must be empty; the lock is set to be p
  6. If a processor p’s wants to unlock the lock, the lock must held by p and p’s write buffer must be empty; the lock is set to be None

Remarks

So, the only funny business that can happen is that a load can happen before a prior store to a different location has been flushed from the write buffer into memory. This means that if CPU0 executes “x = 1; r0 = y” and CPU1 executes “y = 1; r1 = x”, with x and y both initially zero, we can get “r0 == r1 == 0”.

The common intuition that atomic instructions act like there is an MFENCE before and after them is basically right; MFENCE requires the write buffer to empty before it can execute and so do lock and unlock.

x86 is a pleasure to compile atomics code for. The “release” and “acquire” operations in the C++11 memory model don’t require any fencing to work. Neither do the notions of “execution order” and “visibility order” in my advisor and my RMC memory model.

** The story about LFENCE/SFENCE is a little complicated. Some sources insist that they actually do things. The Cambridge model models them as no-ops. The guarantees that they are documented to provide are just true all the time, though. I think they are useful when using non-temporal memory accesses (which I’ve never done), but not in general.

 

Forcing memory barriers on other CPUs with mprotect(2)

I have something of an unfortunate fondness for indefensible hacks.

Like I discussed in my last post, RCU is a synchronization mechanism that excels at protecting read mostly data. It is a particularly useful technique in operating system kernels because full control of the scheduler permits many fairly simple and very efficient implementations of RCU.

In userspace, the situation is trickier, but still manageable. Mathieu Desnoyers and Paul E. McKenney have built a Userspace RCU library that contains a number of different implementations of userspace RCU. For reasons I won’t get into, efficient read side performance in userspace seems to depend on having a way for a writer to force all of the reader threads to issue a memory barrier. The URCU library has one version that does this using standard primitives: it sends signals to all other threads; in their signal handlers the other threads issue barriers and indicate so; the caller waits until every thread has done so. This is very heavyweight and inefficient because it requires running all of the threads in the process, even those that aren’t currently executing! Any thread that isn’t scheduled now has no reason to execute a barrier: it will execute one as part of getting rescheduled. Mathieu Desnoyers attempted to address this by adding a membarrier() system call to Linux that would force barriers in all other running threads in the process; after more than a dozen posted patches to LKML and a lot of back and forth, it got silently dropped.

While pondering this dilemma I thought of another way to force other threads to issue a barrier: by modifying the page table in a way that would force an invalidation of the Translation Lookaside Buffer (TLB) that caches page table entries! This can be done pretty easily with mprotect or munmap.

Full details in the patch commit message.

Why We Fight

Why We Fight, or

Why Your Language Needs A (Good) Memory Model, or

The Tragedy Of memory_order_consume’s Unimplementability

This, one of the most terrifying technical documents I’ve ever read, is why we fight: https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt.

Background

For background, RCU is a mechanism used heavily in the Linux kernel for locking around read-mostly data structures; that is, data structures that are read frequently but fairly infrequently modified. It is a scheme that allows for blazingly fast read-side critical sections (no atomic operations, no memory barriers, not even any writing to cache lines that other CPUs may write to) at the expense of write-side critical sections being quite expensive.

The catch is that writers might be modifying the data structure as readers access it: writers are allowed to modify the data structure (often a linked list) as long as they do not free any memory removed until it is “safe”. Since writers can be modifying data structures as readers are reading from it, without any synchronization between them, we are now in danger of running afoul of memory reordering. In particular, if a writer initializes some structure (say, a routing table entry) and adds it to an RCU protected linked list, it is important that any reader that sees that the entry has been added to the list also sees the writes that initialized the entry! While this will always be the case on the well-behaved x86 processor, architectures like ARM and POWER don’t provide this guarantee.

The simple solution to make the memory order work out is to add barriers on both sides on platforms where it is need: after initializing the object but before adding it to the list and after reading a pointer from the list but before accessing its members (including the next pointer). This cost is totally acceptable on the write-side, but is probably more than we are willing to pay on the read-side. Fortunately, we have an out: essentially all architectures (except for the notoriously poorly behaved Alpha) will not reorder instructions that have a data dependency between them. This means that we can get away with only issuing a barrier on the write-side and taking advantage of the data dependency on the read-side (between loading a pointer to an entry and reading fields out of that entry). In Linux this is implemented with macros “rcu_assign_pointer” (that issues a barrier if necessary, and then writes the pointer) on the write-side and “rcu_dereference” (that reads the value and then issues a barrier on Alpha) on the read-side.

There is a catch, though: the compiler. There is no guarantee that something that looks like a data dependency in your C source code will be compiled as a data dependency. The most obvious way to me that this could happen is by optimizing “r[i ^ i]” or the like into “r[0]”, but there are many other ways, some quite subtle. This document, linked above, is the Linux kernel team’s effort to list all of the ways a compiler might screw you when you are using rcu_dereference, so that you can avoid them.

This is no way to run a railway.

Language Memory Models

Programming by attempting to quantify over all possible optimizations a compiler might perform and avoiding them is a dangerous way to live. It’s easy to mess up, hard to educate people about, and fragile: compiler writers are feverishly working to invent new optimizations that will violate the blithe assumptions of kernel writers! The solution to this sort of problem is that the language needs to provide the set of concurrency primitives that are used as building blocks (so that the compiler can constrain its code transformations as needed) and a memory model describing how they work and how they interact with regular memory accesses (so that programmers can reason about their code). Hans Boehm makes this argument in the well-known paper Threads Cannot be Implemented as a Library.

One of the big new features of C++11 and C11 is a memory model which attempts to make precise what values can be read by threads in concurrent programs and to provide useful tools to programmers at various levels of abstraction and simplicity. It is complicated, and has a lot of moving parts, but overall it is definitely a step forward.

One place it falls short, however, is in its handling of “rcu_dereference” style code, as described above. One of the possible memory orders in C11 is “memory_order_consume”, which establishes an ordering relationship with all operations after it that are data dependent on it. There are two problems here: first, these operations deeply complicate the semantics; the C11 memory model relies heavily on a relation called “happens before” to determine what writes are visible to reads; with consume, this relation is no longer transitive. Yuck! Second, it seems to be nearly unimplementable; tracking down all the dependencies and maintaining them is difficult, and no compiler yet does it; clang and gcc both just emit barriers. So now we have a nasty semantics for our memory model and we’re still stuck trying to reason about all possible optimizations. (There is work being done to try to repair this situation; we will see how it turns out.)

Shameless Plug

My advisor, Karl Crary, and I are working on designing an alternate memory model (called RMC) for C and C++ based on explicitly specifying the execution and visibility constraints that you depend on. We have a paper on it and I gave a talk about it at POPL this year. The paper is mostly about the theory, but the talk tried to be more practical, and I’ll be posting more about RMC shortly. RMC is quite flexible. All of the C++11 model apart from consume can be implemented in terms of RMC (although that’s probably not the best way to use it) and consume style operations are done in a more explicit and more implementable (and implemented!) way.

Doing whole tree analysis with Dehydra

Introduction

In this post I discuss how I used Dehydra to do analysis of the entire Tracemonkey tree.

Note

I talk briefly about some Tracemonkey things to motivate what I used Dehydra for, but a knowledge of Tracemonkey is not required to appreciate the main point of this post.

The problem

One of the optimizations in my Tracemonkey inline threading work (Bug 506182), which I will be posting about later, is “PC update elimination”. Doing this requires figuring out which functions out of a certain set can access the JavaScript virtual program counter (which is stored as a variable named “pc” in a class named “JSFrameRegs”). There are about 230 functions in this set, so doing it manually is impractical.

The solution

I used Dehydra to help solve this problem mechanically. Dehydra is a static analysis tool built on top of gcc. It allows the semantic information of a program to be queried with JavaScript scripts.

First steps

By providing a process_function() function in a Dehydra script, I can inspect the variables used and the functions called by every function. Determining whether the pc is used is as simple as seeing if any variable has the name “JSFrameRegs::pc”.

function process_function(f, statements) {
  for each (let v in iterate_vars(statements)) {
    if (v.name == "JSFrameRegs::pc") {
      print(f.name);
      break;
    }
  }
}

The catch

Unfortunately, this isn’t quite what we want. This will tell us which functions directly use the PC, but not which functions can indirectly use it through function calls. Figuring out that requires looking at all the functions called by a given function. This is not straightforward with Dehydra, as functions frequently call functions declared in other files. Dehydra is a gcc plugin and is driven by the normal build system. Thus, it works on a file-by-file basis. Normal builds output an object file for each source file and rely on the linker to stitch it all together. Likewise, to do a whole tree analysis we need to output per-file information and then link it together later.

Collating the data

I reworked my process_function() to determine both whether the PC is directly accessed and the set of functions called directly. This data is then printed with the dump() function (discussed below):

function process_function(f, statements) {
  usespc = false;
  calls = {};

  for each (let v in iterate_vars(statements)) {
    if (v.name == "JSFrameRegs::pc") {
      usespc = true;
    }
    if (v.isFunction) {
      calls[v.name] = true;
    }
  }
  dump(f.name, usespc, calls);
}

The remaining question is how to structure our output. Outputting it as data declarations for some programming language seems like an easy way to do it. Since I am more comfortable with Python than JavaScript, I output the data as Python code:

function dump(name, usespc, calls) {
  s = '@@@';
  s += '\t"' + name + '": ({';
  for (let f in calls) {
    s += '"' + f + '", ';
  }
  s += '}, ' + (usespc ? "True" : "False") + '),';
  print(s);
}

We create a dictionary mapping function names to tuples of the set containing the name of each function called and whether the PC is accessed directly. When doing a build, the output will be intermixed with output from other parts of the build infrastructure, so we tag all of the output lines with ‘@@@’ so that a post-processing shell script can recognize and extract the relevant data.

The relevant bit of the shell script (which is linked below), is:

(echo "callgraph = {";
cd "$DIR" && make -s CXX="$CXX" 2>&1 | grep ": @@@" | cut -d@ -f4-;
echo "}")

This data gathering method could easily be modified to analyze different problems. For simplicity, I extracted only the relevant information and converted it directly to Python data structures. Another approach would be to dump all of the function information (probably in JSON), allowing a later analysis script access to everything.

The analysis

I wrote an analysis script in Python to process the data. The input is a large dictionary, mapping functions to the set of functions they call and whether they touch PC directly.

I viewed the problem in terms of graph theory. The mapping from functions to the functions they call is simply a directed graph. Each function is a node and there is a edge from a function to each function that it calls. In this graph, some nodes (those that touch the PC) are initially marked. We want to color red every node that has a path to one of the initially marked nodes.

Another way to state this is that a node is colored red if and only if it is either initially marked or has an edge to a red node. Doing the coloring is a simple problem. We simply compute the reverse graph (reversing all of the edges) and then perform a depth first search of the reverse graph starting from the marked nodes. Every node we see, we color red.

When doing the DFS, we keep track of which node we first reached a newly colored node from, so that we can determine a path back to an initially marked node.

My implementation also supports providing a predicate to exclude certain nodes from the search, in order to investigate how fixing certain functions to not require the PC would change things.

Caveats

While my method is useful, it is not perfect. It can produce both false positives and false negatives.

False negatives result from polymorphism in the form of virtual functions and function pointers. When a call to such a function is made, there needs to be an edge from the caller to all of the functions that could be called. For virtual functions, this is all functions overriding the virtual function. For function pointers, I think this is every function of the proper type that is used in a way that it not a call at some point. It should be possible to address this problem, but at a significantly increased complexity compared to what I have. Fortunately for me, Tracemonkey does not use virtual classes much and does not use function pointers in places where it matters for my analysis.

False positives are a little trickier, and can’t be worked around in general. Consider the following:

void foo(bool bar) {
    if (bar)
        // access the PC
}
void baz() { foo(false); }

Calling baz() will not result in the PC being accessed, but my analysis will report that it can access the PC (since foo() can). Knowing whether a function can actually cause the PC to be accessed is isomorphic to the halting problem and thus undecidable.

The code

My code is available here. make_callgraph.sh expects to find g++ and Dehydra under ~/gcc-dehydra/, as suggested in the Dehydra installation instructions. make_callgraph.sh must be run with its output redirected to callgraph.py. search.py contains both the general search algorithm and code that uses it to analyze functions introduced in my inline threading patch.