Internals¶

Deep dive into the internals of libpas.

Introduction¶

This document describes how libpas works as of 247029@main, so a bit ahead of where WebKit was as of 246842@main. Libpas is a fast and memory-efficient memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it'll be used for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.

Since WebKit 186504@main, we've been steadily enabling libpas as a replacement for WebKit's bmalloc and MetaAllocator. This has so far added up to a ~2% Speedometer2 speed-up and a ~8% memory improvement (on multiple memory benchmarks). Half of the speed-up comes from replacing the MetaAllocator, which was JavaScriptCore's old way of managing executable memory. Now, JSC uses libpas's jit_heap to manage executable memory. The other half of the speed-up comes from replacing everything that bmalloc provided -- the fastMalloc API, the Gigacage API, and the IsoHeap<> API. All of the memory improvement comes from replacing bmalloc (the MetaAllocator was already fairly memory-efficient).

This document is structured as follows. First I describe the goals of libpas; these are the things that a malloc-like API created out of libpas should be able to expose as fast and memory-efficient functions. Then I describe the coding style. Next I tell all about the design. Finally I talk about how libpas is tested.

Goals of Libpas¶

Libpas tries to be:

Fast. The goal is to beat bmalloc performance on single-threaded code. Bmalloc was previously the fastest known malloc for WebKit.
Scalable. Libpas is meant to scale well on multi-core devices.
Memory-efficient. The goal is to beat bmalloc memory usage across the board. Part of the strategy for memory efficiency is consistent use of first-fit allocation.
External metadata. Libpas never puts information about a free object inside that object. The metadata is always elsewhere. So, there's no way for a use-after-free to corrupt libpas's understanding of memory.
Multiple heap configurations. Not all programs want the same time-memory trade-off. Some malloc users have very bizarre requirements, like what JavaScriptCore does with its ExecutableAllocator. The goal is to support all kinds of special allocator needs simultaneously in one library.
Boatloads of heaps. Libpas was written with the dream of obviating the need for ownership type systems or other compiler approaches to fixing the type-safety of use-after-frees. This means that we need one heap per type, and be 100% strict about it. So, libpas supports tons of heaps.
Type-awareness. Sometimes, malloc decisions require knowing what the type's size and alignment are, like when deciding how to split and coalesce memory. Libpas is designed to avoid type errors arising from the malloc's "skewed" reuse of memory.
Common free. Users of libpas isoheaps don't have to know the heap of an object when they free it. All objects should funnel into the same free function. One kind of exception to this requirement is stuff like ExecutableAllocator, which needs a malloc, but is fine with not calling a common free function.
Cages. WebKit uses virtual memory reservations called cages, in which case WebKit allocates the virtual memory and the malloc has to associate that memory with some heap. Libpas supports multiple kinds of cages.

Libpas Style¶

Libpas is written in C. Ultimately, I chose C because I felt that the language provides better support for extremely low-level code:

C++ is usually my favorite, because it makes code easier to write, but for libpas, I wanted something easier to read. It's easier to read C when auditing for subtle bugs, because there's nothing hidden. C doesn't have stuff like destructor invocations or operator overloading, which result in surprising effectfulness in otherwise innocent-looking code. Memory management code like libpas has to be read a lot, so C is better.
C makes casting between pointers and integers very simple with its style of cast operator. It feels weird to use the C cast operator in C++, so when I have to do a lot of uintptr_t'ing, I prefer C.

C lets you do most of what C++ can if you rely on always_inline. This didn't used to be the case, but modern C compilers will meat-grind the code with repeated application of the following things:

Inlining any always_inline call except if it's recursive or the function uses some very weird features that libpas doesn't use (like goto pointer).
Copy-propagating the values from the callsite into the function that uses the value.

Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an always_inline function and the callee is always_inline results in specialization akin to template monomorphization. This works to any depth; the compiler won't be satisfied until there are no more always_inline function calls. This fortuitous development in compilers allowed me to write very nice template code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to always_inline functions (when we want specialization and inlining) and sometimes to out-of-line functions (when we want specialization but not inlining). Additionally, the C template style allows us to have true polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just pass around a pointer to the config they are using, and dynamically load and call things in that config, almost exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates.

Most of libpas is written in an object-oriented style. Structs are used to create either by-value objects or heap-allocated objects. It's useful to think of these as classes, but a loose way since there are many ways to do classes in C, and libpas uses whatever techniques are best on a per-class basis. But heap allocated objects have a clear convention: for a class named foo, we would call the struct pas_foo, and for a method bar on foo, we would call the function pas_foo_bar and have the first parameter be pas_foo*. The function that creates instances of foo is called pas_foo_create (or pas_foo_create_blah_blah in case of overloading) and returns a pas_foo*. The function that destroys foo objects is called pas_foo_destroy and takes a pas_foo*.

Libpas classes are usually implemented in files called pas_foo.h (the header that defines the struct and a subset of functions), pas_foo_inlines.h (the header that defines inline functions of foo that require calling functions declared in headers that pas_foo.h can't include), and pas_foo.c (the implementations of foo functions that can be out-of-line).

Some libpas "classes" are singletons. The standard way of implementing a singleton in libpas is that there is really no struct, only global variables and functions that are declared in the header. See pas_page_malloc or pas_debug_heap for examples of singletons.

Not everything in libpas is a class. In cases where a bunch of not-class-like things can be grouped together in a way that makes sense, we usually do something like a singleton. In cases where a function can't easily be grouped together with some class, even a singleton, we name the file it's in after the function. There are lots of examples of this, like pas_deallocate or pas_get_page_base. Sometimes this gets fun, like pas_get_page_base_and_kind_for_small_other_in_fast_megapage.h.

Finally, libpas avoids abbreviations even more so than WebKit usually does. Functions that have a quirky meaning typically have a long name that tells the story. The point is to make it easy to appreciate the subtlety of the algorithm when reading the code. This is the kind of code where complex situations should look complex at any abstraction level.

Design of Libpas¶

Libpas is organized into roughly eleven areas:

Heap configurations. This is the way that we tell libpas how to organize a heap. Heap configurations can control a lot. They can change obvious things like the minalign and page size, but also more crazy things, like how to find a page header given a page and vice-versa.
The large heaps. This is a first-fit heap based on arrays, cartesian trees, and hashtables. The large heap has excellent type safety support and can be safely (though not efficiently) used for small objects.
Metacircularity. Libpas uses malloc-like APIs internally for managing its state. These are provided by the so-called bootstrap, utility, and immortal heaps.
The segregated heaps and TLCs (thread-local caches). Libpas has a super fast simple segregated storage slab allocator. It supports type safety and is the most commonly used kind of heap.
The bitfit heaps. This is a fast and memory-efficient type-unsafe heap based on slabs and bitmaps.
The scavenger. Libpas performs a bunch of periodic bookkeeping tasks in a scavenger thread. This includes, but is not limited to, returning memory to the OS.
Megapages and page header tables. Libpas has multiple clever tricks for rapidly identifying which kind of heap an object belongs to. This includes an arithmetic hack called megapages and some lock-free hashtables.
The enumerator. Libpas supports malloc heap enumeration APIs.
The basic configuration template, used to create the bmalloc_heap API that is used as a replacement for all of bmalloc's functionality.
The JIT heap config.
The fast paths. The various heaps, TLCs, megapages and page header tables are glued together by fast paths provided for allocation, deallocation, and various utility functions.

Heap Configurations¶

The pas_heap_config struct defines all of the configurable behaviors of a libpas heap. This includes things like how the heap gets its memory, what size classes use segregated, bitfit, or large allocators, and a bunch of other things.

Heap configs are passed by-value to functions that are meant to be specialized and inlined. To support this, the convention for defining a heap config is that you create a macro (like BMALLOC_HEAP_CONFIG) that gives a heap config literal expression. So, a call like pas_get_allocation_size(ptr, BMALLOC_HEAP_CONFIG) will give you an optimized fast path for getting the allocation size of objects in bmalloc. This works because such fast paths are always_inline.

Heap configs are passed by-pointer to functions that are not meant to be specialized. To support this, all heap configs also have a global variable like bmalloc_heap_config, so we can do things like pas_large_heap_try_deallocate(base, &bmalloc_heap_config).

Heap configs can have up to two segregated page configs (config.small_segregated_config and config.medium_segregated_config) and up to three bitfit page configs (config.small_bitfit_config, config.medium_bitfit_config, and config.marge_bitfit_config). Any of the page configs can be disabled, though weird things might happen if the smallest ones are disabled (rather than disabling the bigger ones). Page configs (pas_segregated_page_config, pas_bitfit_page_config, and the common supertype, pas_page_base_config) get used in much the same way as heap configs -- either by-value for specialized and inlined functions or by-pointer for unspecialized functions.

Heap and page configs also support specialized-but-not-inlined functions. These are supported using additional function pointers in those configs that are filled in using macros -- so you don't need to fill them in explicitly when creating your own config, like BMALLOC_HEAP_CONFIG or JIT_HEAP_CONFIG. The macros fill them in to point at never_inline functions that call some specialized and inlined function with the config passed as a constant. This means for example that:

BMALLOC_HEAP_CONFIG.specialized_local_allocator_try_allocate_small_segregated_slow(...);

Is an out-of-line direct function call to the specialization of pas_local_allocator_try_allocate_small_segregated_slow. And this would be a virtual call to the same function:

pas_heap_config* config = ...;
config->specialized_local_allocator_try_allocate_small_segregated_slow(...);

Note that in many cases where you have a pas_heap_config, you are in specialized code and the heap config is a known constant at compile to, so then:

config.specialized_local_allocator_try_allocate_small_segregated_slow(...);

Is an out-of-line direct function call.

The Large Heaps¶

Libpas's large heaps serve multiple purposes:

Everything is bootstrapped on large heaps. When segregated and bitfit heaps allocate memory, they do so from some large heap.
Segregated and bitfit heaps have object size ceilings in the tens or hundreds of kilobytes. So, objects that are too large for the other heaps get allocated from large heaps.

Large heaps are broken into two parts:

The large free heap. In libpas jargon, a free heap is a heap that requires that deallocation passes the object size, requires that the freed object size matches the allocated object size for that object, and makes no guarantees about what kind of mess you'll get yourself into if you fail to obey that rule.
The large map. This maps object pointer to size and heap.
The large heap. This is an abstraction over both (1) and (2).

Large free heaps just maintain a free-list; they know nothing about allocated objects. But combined with the large map, the large heaps provide a user-friendly deallocation API: you just need the object pointer, and the large map figures out the rest, including identifying which free heap the object should be deallocated into, and the size to pass to that free heap.

Large heaps operate under a single global lock, called the heap lock. Most libpas heaps use fine-grained locking or avoid locking entirely. But for the large heap, libpas currently just uses one lock.

Large Free Heap¶

Large free heaps are built out of a generic algorithm that doesn't know how to represent the free-list and gets instantiated with either of two free-list representations, simple and fast. The simple large free heap uses an unordered array of coalesced free object descriptions. The fast large free heap uses a cartesian tree of coalesced free object descriptions.

A free object description is represented by the pas_large_free object in libpas; let's just call it a large free for brevity. Large frees can tell you the beginning and end of a free chunk of memory. They can also tell if the memory is already known to be zero and what the type skew of the free memory is. The large heap can be used to manage arrays of some type that is either larger than the heap's minimum alignment, or that is smaller than and not a divisor of the alignment. Especially when this is combined with memalign, the free heap will have to track free memory that isn't type-aligned. Just consider a type of size 1000 that is allocated with alignment 16384. The rules of memalign say that the size must be 16384 in that case. Assuming that the free heap had 32768 contiguous bytes of free memory to begin with, it will now have 16384 bytes that starts with a type skew of 384 bytes. The type skew, or offset_in_type as libpas calls it, is the offset of the beginning of the large free inside the heap's type. In extremely complex cases, this means that finding where the first valid object address is inside a large free for some type and alignment requires computing the offset least common multiple (see pas_coalign), which relies on the right bezout coefficient of the extended GCD of the type size and alignment (see pas_extended_gcd).

Large frees support an API for coalescing (merging as libpas calls it) and splitting. The generic large free heap handles searching through large frees to find the first one that matches an allocation request for some size and alignment. It also handles coalescing freed memory back into the heap, by searching for adjacent free memory. The searches are through a struct of function pointers that may be implemented either efficiently (like the simple large free heap's O(n) search through an unordered array) or efficiently (like the O(1)-ish or O(log n)-ish operations on the cartesian tree in the fast large free heap). The generic algorithm uses the C generic idiom so there are no actual function pointer calls at runtime.

Large free heaps allow you to give them callbacks for allocating and deallocating memory. The allocation callback will get called if you ask a large free heap for memory and it doesn't have it. That allocation callback could get the memory from the OS, or it could get it from some other heap. The deallocation callback is for those cases where the large free heap called the allocation callback and then decided it wanted to give some fraction of that memory back. Both callbacks are optional (can be NULL), though the case of a NULL allocation callback and non-NULL deallocation callback is not useful since the deallocation callback only gets called on the path where we had an allocation callback.

Note that large free heaps do not do anything to decommit their free memory. All decommit of memory in large free heaps is accomplished by the large sharing pool, which is part of the scavenger.

Large Map¶

The large map is a hashtable that maps object addresses to large entries, which contain the size of the object and the heap it belongs to. The large map has a fairly complex hashtable algorithm because of my past attempts at making the large heap at least somewhat efficient even for small objects. But it's conceptually a simple part of the overall algorithm. It's also legal to ask the large map about objects it doesn't know about, in which case, like a normal hashtable, it will just tell you that it doesn't know about your object. Combined with the way that segregated and bitfit heaps use megapage tables and page header tables, this means that libpas can do fall-through to another malloc for objects that libpas doesn't manage.

Note that it might be OK to remove the small object optimizations in the large map. On the other hand, they are reliable, and they aren't known to increase the cost of the algorithm. Having that capability means that as part of tuning the algorithm, it's more safe than it would otherwise be to try putting some small objects into the large heap to avoid allocating the data structures required for operating a segregated or bitfit heap.

The Large Heap¶

The large free heap and large map are combined into a high-level API with the pas_large_heap. In terms of state, this is just a pas_large_free_heap plus some data to help with small-object optimizations in the large map. The functions of the large heap do a lot of additional work:

They give the free heap an allocator for getting new memory. The large heap routes memory allocation requests to the heap config's allocation callback.
They ensure that each free heap allocation ends up in the large map.
They implement deallocation by removing something from the large map and then deallocating it into the free heap.
They provide integration with the scavenger's large sharing pool so that free memory can be decommitted.

The large heap is always used as a member of the pas_heap object. It's useful to think of pas_large_heap as never being a distinct object; it's more of a way of compartmentalizing pas_heap. The heap object also contains a segregated heap and some other stuff.

Metacircularity¶

I'm used to programming with dynamically allocated objects. This lets me build arrays, trees, hashtables, look-up tables, and all kinds of lock-free data structures. So, I wanted libpas's internals to be able to allocate objects just like any other kind of algorithm would do. But libpas is engineered so that it can be a "bottom of the world" malloc -- where it is the implementation of malloc and free and cannot rely on any memory allocation primitives other than what the kernel provides. So, libpas uses its own allocation primitives for its own objects that it uses to implement those primitives. This is bootstrapped as follows:

The bootstrap heap is a simple large free heap. A simple large free heap needs to be able to allocate exactly one variable-length array of large frees. The bootstrap heap has hacks to allow itself to allocate that array out of itself. This trick then gives us a complete malloc implementation for internal use by libpas, albeit one that is quite slow, can only be used under the heap lock, and requires us to know the object's size when freeing it. All other simple large free heaps allocate their free lists from the bootstrap heap. The bootstrap heap is the only heap in libpas that asks the OS for memory. All other heaps ask either the bootstrap heap for memory, or they ask one of the other heaps.
The compact reservation is 128MB of memory that libpas uses for objects that can be pointed at with 24-bit (3 byte) pointers assuming 8-byte alignment. Libpas needs to manage a lot of heaps, and that requires a lot of internal meta-data, and having compact pointers reduces the cost of doing this. The compact reservation is allocated from the bootstrap heap.
The immortal heap is a heap that bump-allocates out of the compact reservation. It's intended for small objects that are immortal.
The compact bootstrap heap is like the bootstrap heap, except that it allocates its memory from the compact reservation, and allocates its free list from the bootstrap heap rather than itself.
The compact large utility free heap is a fast large free heap that supports decommitting free memory (see the scavenger section) and allocates its memory from the compact bootstrap heap.
The utility heap is a segregated heap configured to be as simple as possible (no thread-local caches for example) and can only be used while holding the heap lock. It only supports objects up to some size (PAS_UTILITY_LOOKUP_SIZE_UPPER_BOUND), supports decommitting free memory, and gets its memory from the compact bootstrap heap. One example of how the utility heap gets used is the nodes in the cartesian trees used to implement fast large free heaps. So, for example, the compact large utility free heap relies on the utility heap.
The large utility free heap is a fast large free heap that supports decommitting free memory and allocates its memory from the bootstrap heap.

Note how the heaps pull memory from one another. Generally, a heap will not return memory to the heap it got memory from except to "undo" part of an allocation it had just done. So, this arrangement of who-pulls-memory-from-who is designed for type safety, memory efficiency, and elegantly supporting weird alignments:

Libpas uses decommit rather than unmapping free memory because this ensures that we don't ever change the type of memory after that memory gets its type for the first time.
The lower-level heaps (like bootstrap and compact bootstrap) do not support decommit. So, if a higher-level heap that does support decommit ever returned memory to the lower-level heap, then the memory would never get decommitted.
Page allocation APIs don't let us easily allocate with alignment greater than page size. Libpas does this by over-allocating (allocating size + alignment and then searching for the first aligned start within that larger reservation). This is all hidden inside the bootstrap heap; all other heaps that want memory on some weird alignment just ask some other heap for memory (often the bootstrap heap) and that heap, or ultimately the bootstrap heap, figure out what that means in terms of system calls.

One missing piece to the metacircularity is having a fast utility heap that uses thread-local caches. There is currently maybe one utility heap callsite that only grabs the heap lock just because it wants to allocate in the utility heap. There's a possibility of a small speed-up if any callsite like that used a fast utility heap instead, and then no locking would be required. It's not clear how easy that would be; it's possible that some bad hacks would be required to allow code that uses TLCs to call into a heap that then also uses TLCs.

Segregated Heaps and TLCs (Thread Local Caches)¶

Libpas's great performance is mostly due to the segregated heaps and how they leverage thread-local caches. TLCs provide a cache of global memory which. In the best case, this cache prevents threads from doing any synchronization during allocation and deallocation. Even when they do have to do some synchronization, TLCs make it unlikely that one thread will ever want to acquire a lock held by another thread. The strategy is three-fold:

The TLC has a per-size-class allocator that caches some amount of that size class's memory. This means that the allocation fast path doesn't have to do any locking or atomic instructions except when its cache runs out. Then, it will have to do some synchronization -- in libpas's case, fine-grained locking and some lock-free algorithms -- to get more memory. The amount of memory each allocator can cache is bounded (usually 16KB) and allocators can only hold onto memory for about a second without using it before it gets returned (see the scavenger section).
The TLC has a deallocation log. The fast path of deallocating a segregated heap object is just pushing it onto the deallocation log without any locking. The slow path is to walk the log and free all of the objects. The libpas deallocation log flush algorithm cleverly avoids doing per-object locking; in the best case it will acquire a couple of locks before flushing and release them after flushing the whole log.
When the deallocation log flush frees memory, it tries to first make that memory available exclusively to the thread that freed it by putting the free memory into the local view cache for that size class in that thread. Memory moves from the local view caches into the global heap only if the view cache is full (has about 1.6MB of memory in it) or if it hasn't been used in about a second (see the scavenger section).

This section lays out the details of how this works. Segregated heaps are organized into segregated directories. Each segregated directory is an array of page views. Each page view may or may not have a page associated with it. A view can be exclusive (the view has the page to itself), partial (it's a view into a page shared by others), or shared (it represents a page shared by many partial views). Pages have a page boundary (the address of the beginning of the page) and a page header (the object describing the page, which may or may not actually be inside the page). Pages maintain alloc bits to tell which objects are live. Allocation uses the heap's lookup tables to find the right allocator index in the TLC, which yields a local allocator; that allocator usually has a cache of memory to allocate from. When it doesn't, it first tries to pop a view from the local view cache, and if that fails, it uses the find first eligible algorithm on the corresponding directory to find an eligible view. One the allocator has a view, it ensures that the view has a page, and then scans the alloc bits to create a cache of free memory in that page. Deallocation fast paths just push the object onto the deallocation log. When the log is full, the TLC flushes its log while trying to amortize lock acquisition. Freeing an object in a page means clearing the corresponding alloc bit. Once enough alloc bits are clear, either the page's view ends up on the view cache, or the directory is notified to mark the page either eligible or empty. The sections that follow go into each of these concepts in detail.

The Thread Local Cache¶

Each thread has zero or one pas_thread_local_caches. Libpas provides slow paths for allocating and deallocating without a TLC in cases where TLC creation is forbidden (like when a thread is shutting down). But usually, allocation and deallocation create a TLC if the thread doesn't already have one. TLCs are structured as follows:

TLCs contain a fixed-size deallocation log along with an index that tells how much of the log is full. Deallocation pushes onto that log.
TLCs contain a variable-length array of allocators, which are really either local allocators or local view caches. Allocators are variable length. Clients access allocators using an allocator index, which they usually get from the directory that the allocator corresponds to.
TLCs can get reallocated and deallocated, but they always point to a pas_thread_local_cache_node, which is an immortal and compact descriptor of a TLC. TLC nodes are part of a global linked list. Each TLC node may or may not have a live TLC associated with it. TLCs cannot be created or destroyed unless the heap lock is held, so if you hold the heap lock, you can iterate the TLC node linked list to find all TLCs.
The layout of allocator indices in a TLC is controlled by both the directories and the TLC layout data structure (pas_thread_local_cache_layout). This is a global data structure that can tell us about all of the allocators in a TLC. When holding the heap lock, it's possible to loop over the TLC layout linked list to find what all of the valid allocator indices are and to introspect what is at those indices.

Thread local caches tend to get large because both the local allocators and local view caches have inline arrays. The local allocator has an array of bits that tell the allocator where the free objects are. The local view cache has an array of view pointers (up to 1.6MB / 16KB = 100 entries, each using a 24-bit pointer). When used in single-heap applications, these overheads don't matter -- they end up being accounting for less than 10^-5 of the overall process footprint (not just in WebKit but when I experimented with libpas in daemons). But when used for many heaps, these overheads are substantial. Given thousands or tens of thousands of heaps, TLCs account for as much as 1% of memory. So, TLCs support partial decommit. Those pages that only have allocators that are inactive get decommitted. Note that TLC decommit has landed in the libpas.git repo as of 247029@main, but hasn't yet been merged into WebKit.

The TLC deallocation log flush algorithm is designed to achieve two performance optimizations:

It achieves temporal locality of accesses to page headers. If freeing each object meant flipping a bit in the page header, then many of those operations would miss cache, since the page header is not accessed by normal program operation -- it's only accessed during some allocation slow paths and when deallocating. But because deallocation accesses page headers only during log flush and log flush touches about 1000 objects, it's likely that the flush will touch the same page header cache lines multiple times.
It reduces the average number of lock acquisitions needed to free an object. Each page uses its own lock to protect its page header, and the page header's alloc_bits. But deallocation log flush will do no locking or unlocking if the object at index i and the object at index i+1 use the same lock. Pages can dynamically select which lock they use (thanks to page->lock_ptr), and they select it so that pages allocated out of by a thread tend to share the same lock, so deallocation log flush usually only just acquires a lock at the start of the 1000 objects and releases it when it finishes the 1000 objects.

The Segregated Heap¶

The pas_segregated_heap object is the part of pas_heap that facilitates segregated heap and bitfit heap allocation. The details of the bitfit heap are discussed in a later section. Segregated heaps can be created separately from a pas_heap, but segregated heaps are almost always part of a pas_heap (and it would be easy to refactor libpas to make it so that segregated heaps are always part of heaps).

Segregated heaps use size directories to track actual memory. Most of the exciting action of allocation and deallocation happens in directories. Each directory corresponds to some size class. Segregated heaps make it easy to find the directory for a particular size. They also make it easy to iterate over all the directories. Also, segregated heaps make it easy to find the allocator index for a particular size (using a lookup table that is essentially a cache of what you would get if you asked for the directory using the size-to-directory lookup tables and then asked the directory for the allocator index). The most exciting part of the segregated heap algorithm is pas_segregated_heap_ensure_size_directory_for_size, which decides what to do about allocating a size it hadn't encountered before. This algorithm will either return an existing directory, create a new one, or even retire an existing one. It handles all of the issues related to type size, type alignment, and the alignment argument to the current malloc call.

The lookup tables maintained by segregated heaps have some interesting properties:

They can be decommitted and rematerialized. This is a useful space saving when having lots of isoheaps. The rematerialization happens because a heap also maintains a linked list of directories, and that linked list never goes away. Each directory in the linked list knows what its representation would have been in the lookup tables.
They are optional. Some heaps can be configured to have a preferred size, called the basic size class. This is very common for isoheaps, which may only ever allocate a single size. For isoheaps based on type, the basic size class is just that type's size. Other isoheaps dynamically infer a preferred size based on the first allocation. When a heap only has the basic size class, it will have no lookup tables.
There are separate lookup tables for smaller sizes (not related to the small_segregated_config -- the threshold is set separately) are just arrays whose index is the size divided by the heap's minalign, rounded up. These may be populated under heap lock while they are accessed without any locks. So, accesses to them have some guards against races.
There are separate lookup tables for medium sizes (anything above the threshold for small lookup tables). The medium table is a sorted array that the allocator binary-searches. It may be mutated, decommitted, rematerialized, or reallocated under heap lock. The algorithm defends itself against this with a bunch of compiler fences, a mutation count check. Mutating the table means incrementing-to-mutate before making changes and incrementing-to-finish after making changes. So, the algorithm for lock-free lookup checks mutation count before and after, and makes sure they are the same and neither indicates that we are mutating. This involves clever use of dependency threading (like the ARM64 eor-self trick) to make sure that the mutation count reads really happen before and after the binary search.

Segregated Directories¶

Much of the action of managing memory in a segregated heap happens in the segregated directories. There are two kinds:

Segregated size directories, which track the views belonging to some size class in some heap. These may be exclusive views, which own a page, or partial views, which own part of a shared page. Partial views range in size between just below 512 bytes to possibly a whole page (in rare cases).
Segregated shared page directories, which track shared views. Each shared view tracks shared page and which partial views belong to it. However, when a shared page is decommitted, to save space, the shared view will forget which partial views belong to it; they will re-register themselves the first time someone allocates in them.

Both of them rely on the same basic state, though they use it a bit differently:

A lock-free-access vector of compact view pointers. These are 4-byte pointers. This is possible because views are always allocated out of the compact reservation (they are usually allocated in the immortal heap). This vector may be appended to, but existing entries are immutable. So, resizing just avoids deleting the smaller-sized vectors so that they may still be accessed in case of a race.
A lock-free-access segmented vector of bitvectors. There are two bitvectors, and we interleave their 32-bit words of bits. The eligible bitvector tells us which views may be allocated out of. This means different things for size directories than shared page directories. For size directories, these are the views that have some free memory and nobody is currently doing anything with them. For shared page directories, these are the shared pages that haven't yet been fully claimed by partial views. The empty bitvector tells us which pages are fully empty and can be decommitted. It's never set for partial views. It means the same thing for both exclusive views in size directories and shared views in shared page directories.

Both bitvectors are searched in order:

Eligible bitvectors are searched first-fit.
Empty bitvectors are searched last-fit.

Searches are made fast because the directory uses the lock-free tricks of pas_versioned_field to maintain two indices:

A first-eligible index. This always points to the first eligible bit, except in cases where some thread has set the bit but hasn't gotten around to setting the first-eligible index. In other words, this may have some lag, but the lag is bounded.
A last-empty-plus-one index. This always points to the index right after the last empty bit. If it's zero, it means there are no set empty bits. If it's the number of views, then it means the last view is empty for sure and there may be any number of other empty views.

These versioned indices can be read without any atomic instructions in many cases, though most mutations to them require a pair of 128-bit compare-and-swaps.

The eligible bitvector together with the first-eligible index allow for very fast searches to find the first eligible view. Bitvector searches are fast to begin with, even over a segmented vector, since the segmented vector has large-enough chunks. Even searching the whole bitvector is quite efficient because of the properties of bitvector simd (i.e. using a 32-bit or 64-bit or whatever-bit word to hold that many bits). But the first-eligible index means that most searches never go past where that index points, so we get a mostly-O(1) behavior when we do have to find the first eligible view.

The empty bitvector gives a similar property for the scavenger, which searches backwards to find empty views. The efficiency here arises from the fact that empty pages know the timestamp of when they became empty, and the scavenger will terminate its backwards search when it finds a too-recently emptied page.

Directories have to make some choices about how to add views. View addition happens under heap lock and must be in this order:

First we make sure that the bitvectors have enough room for a bit at the new index. The algorithm relies on the size of the view vector telling us how many views there are, so it's fine for the bitvectors are too big for a moment. The segmented vector algorithm used for the bitvectors requires appending to happen under heap lock but it can run concurrently to accesses to the vector. It accomplishes this by never deallocating the too-small vector spines.
Then we append the the view to the view vector, possibly reallocating the view vector. Reallocation keeps the old two-small copy of the vector around, allowing concurrent reads to the vector. The vector append stores the value, executes an acqrel fence (probably overkill -- probably could just be a store fence), and then increments the size. This ensures nobody sees the view until we are ready.

Directories just choose what kind of view they will create and then create an empty form of that view. So, right at the point where the vector append happens, the view will report itself as not yet being initialized. However, any thread can initialize an empty view. The normal flow of allocation means asking a view to "start allocating". This actually happens in two steps (will_start_allocating and did_start_allocating). The will-start step checks if the view needs to commit its memory, which will cause empty exclusive views to allocate a page. Empty partial views get put into the partial primordial state where they grab their first chunk of memory from some shared view and prepare to possibly grab more chunks of that shared view, depending on demand. But all of this happens after the directory has created the view and appended it. This means that there is even the possibility that one thread creates a view, but then some other thread takes it right after it was appended. In that case, the first thread will loop around and try again, maybe finding some other view that had been made eligible in the meantime, or against appending another new view.

Size directories maintain additional state to make page management easy and to accelerate allocation.

Size directories that have enabled exclusive views have a full_alloc_bits vector that has bits set for those indices in their pages where an object might start. Pages use bitvectors indexed by minalign, and only set those bits that correspond to valid object offsets. The full_alloc_bits vector is the main way that directories tell where objects could possibly be in the page. The other way they tell is with offset_from_page_boundary_to_first_object and offset_from_page_boundary_to_end_of_last_object, but the algorithm relies on those a bit less.

Size directories can tell if they have been assigned an allocator index or a view cache index, and control the policies of when they get them. A directory without an allocator index will allocate out of baseline allocators, which are shared by all threads. Having an allocator index implies that the allocator index has also been stored in the right places in the heap's lookup tables. Having a view cache index means that deallocation will put eligible pages on the view cache before marking them eligible in the directory.

Page Boundaries, Headers and Alloc Bits¶

"Pages" in the segregated heap are a configurable concept. Things like their size and where their header lives can be configured by the pas_segregated_page_config of their directory. The config can tell which parts of the page are usable as object payload, and they can provide callbacks for finding the page header.

The page header contains:

The page's kind. The segregated page header is a subtype of the pas_page_base, and we support safely downcasting from pas_page_base* to pas_segregated_page*. Having the kind helps with this.
Whether the page is in use for allocation right now. This field is only for exclusive views. Allocation in exclusive pages can only happen when some local allocator claims a page. For shared pages, this bit is in each of the the partial views.
Whether we have freed an object in the page while also allocating in it. This field is only for exclusive views. When we finish allocating, and the bit is set, we do the eligibility stuff we would have done if we had freed objects without the page being used for allocation. For shared pages, this bit is in each of the partial views.
The sizes of objects that the page manages. Again, this is only used for exclusive views. For shared pages, each partial view may have a different size directory, and the size directory tells the object size. It's also possible to get the object size by asking the exclusive view for its size directory, and you will get the same answer as if you had asked the page in that case.
A pointer to the lock that the page uses. Pages are locked using a kind of locking dance: you load the page->lock_ptr, lock that lock, and then check if page->lock_ptr still points at the lock you tried. Anyone who holds both the current lock and some other lock can change page->lock_ptr to the other lock. For shared pages, the lock_ptr always points at the shared view's ownership lock. For exclusive views, the libpas allocator will change the lock of a page to be the lock associated with their TLC. If contention on page->lock_ptr happens, then we change the lock back to the view's ownership lock. This means that in the common case, flushing the deallocation log will encounter page after page that wants to hold the same lock -- usually the TLC lock. This allows the deallocation log flush to only do a handful of lock acquisitions for deallocating thousands of objects.
The timestamp of when the page became empty, using a unit of time libpas calls epoch.
The view that owns the page. This is either an exclusive view or a shared handle, which is the part of the shared view that gets deallocated for decommitted pages. Note: an obvious improvement is if shared handles were actually part of the page header; they aren't only because until recently, the page header size had to be the same for exclusive and shared pages.
View cache index, if the directory enabled view caching. This allows deallocation to quickly find out which view cache to use.
The alloc bits.
The number of 32-bit alloc bit words that are not empty.
Optionally, the granule use counts. It's possible for the page config to say that the page size is larger than the system page size, but that the page is divided up into granules which are system page size. In that case, the page header will have an array of 1-byte use counts per granule, which count the number of objects in that granule. They also track a special state when the granule is decommitted. The medium_segregated_config uses this to offer fine-grained decommit of 128KB "pages".

Currently we have two ways of placing the page header: either at the beginning of the page, or what we call the page boundary, or in an object allocated in the utility heap. In the latter case, we use the mostly-lock-free page header table to map between the page boundary and page header, or vice-versa. The page config has callbacks that allow either approach. I've also used page config hacking to attempt other kinds of strategies, like saying that every aligned 16MB chunk of pages has an array of page headers at the start of it; but those weren't any better than either of the two current approaches.

The most important part of the page header is the alloc bits array and the num_non_empty_words counter. This is where most of the action of allocating and deallocating happens. The magic of the algorithm arises from the simple bitvector operations we can perform on page->alloc_bits, full_alloc_bits (from the size directory in case of exclusive pages or from the partial view in case of shared pages), and the allocator->bits. These operations allow us to achieve most of the algorithm:

Deallocation clears a bit in page->alloc_bits and if this results in the word becoming zero, it decrements the num_non_empty_words. The bit index is just the object's offset shifted by the page config's min_aligh_shift, which is a compile-time constant in most of the algorithm. If the algorithm makes any bit (for partials) or any word (for exclusives) empty, it makes the page eligible (either by putting it on a view cache or marking the view eligible in its owning directory). If num_non_empty_words hits zero, the deallocator also makes the view empty.
Allocation does a find-first-set-bit on the allocator->bits, but in a very efficient way, because the current 64-bit word of bits that the allocator is on is cached in allocator->current_word -- so allocation rarely searches an array. So, the allocator just loads the current word, does a ctz or clz kind of operation (which is super cheap on modern CPUs), left-shifts the result by the page config's minalign, and adds the allocator->page_ish (the address in memory corresponding to the first bit in current_word). That's the allocator fast path.
We prepare allocator->bits by basically saying allocator->bits = full_alloc_bits & ~page->alloc_bits. This is a loop since each of the bits is an array of words of bits and each array is the same size. For a 16384 size page (the default for small_segregated_config) and a minalign shift of 4 (so minalign = 16, the default for small_segregated_config), this means 1024 bits, or 32 32-bit words, or 128 bytes. The loop over the 32 32-bit words is usually fully unrolled by the compiler. There are no loop-carried dependencies. This loop shows up in profiles, and though I've tried to make it faster, I've never succeeded.

Local Allocators¶

Each size directory can choose to use either baseline allocators or TLC local allocators for allocation. Each size directory can choose to have a local view cache or not. Baseline allocators are just local allocators that are global and not part of any TLC and allocation needs to grab a lock to use them. TLC local allocators don't require any locking to get accessed.

Local allocators can be in any of these modes:

They are totally uninitialized. All fast paths fail and slow paths will initialize the local allocator by asking the TLC layout. This state happens if TLC decommit causes a local allocator to become all zero.
They are in bump allocation mode. Bump allocation happens either when a local allocator decides to allocate in a totally empty exclusive page, or for primordial partial allocation. In the former case, it's worth about 1% performance to sometimes bump-allocate. In the latter case, using bump allocation is just convenient -- the slow path will decide that the partial view should get a certain range of memory within a shared page and it knows that this memory has never been used before, so it's natural to just set up a bump range over that memory.
They are in free bits mode. This is slightly more common than the bump mode. In this mode, the allocator->bits is computed using full_alloc_bits & ~page->alloc_bits and contains a bit for the start of every free object.
They are in bitfit mode. In this mode, the allocator just forwards allocations to the pas_bitfit_allocator.

Local allocators can be stopped at any time; this causes them to just return all of their free memory back to the heap.

Local allocators in a TLC can be used without any conventional locking. However, there is still synchronization taking place because the scavenger is allowed to stop allocators. To support this, local allocators set an in_use bit (not atomically, but protected by a pas_compiler_fence) before they do any work and clear it when done. The scavenger thread will suspend threads that have TLCs and then while the TLC is suspended, they can stop any allocators that are not in_use.

The Bitfit Heaps¶

Libpas is usually used with a combination of segregated and large heaps. However, sometimes we want to have a heap that is more space-efficient than segregated but not quite as slow as large. The bitfit heap follows a similar style to segregated, but:

While bitfit has a bit for each minalign index like segregated, bitfit actually uses all of the bits. To allocate an object in bitfit, all of the bits corresponding to all of the minalign granules that the object would use have to be free before the allocation and have to be marked as not free after the allocation. Freeing has to clear all of the bits.
The same page can have objects of any size allocated in it. For example, if a 100 byte object gets freed, then it's legal to allocate two 50 byte objects out of the freed space (assuming 50 is a multiple of minalign).
A bitfit directory does not represent a size class. A bitfit heap has one directory per bitfit page config and each page config supports a large range of sizes (the largest object is ~250 times larger than the smallest).

Bitfit pages have free_bits as well as object_end_bits. The free_bits indicates every minalign granule that is free. For non-free granules, the object_end_bits has an entry for every granule that is the last granule in some live object. These bits get used as follows:

To allocate, we find the first set free bit and then find the first clear free bit after that. If this range is big enough for the allocation, we clear all of the free bits and set the object end bit. If it's not big enough, we keep searching. We do special things (see below) when we cannot allocate in a page.
To free, we find the first set object end bit, which then gives us the object size. Then we clear the object end bit and set the free bits.

This basic style of allocation is usually called bitmap allocation. Bitfit is a special kind of bitmap allocation that makes it cheap to find the first page that has enough space for an allocation of a given size. Bitfit makes allocation fast even when it is managing lots of pages by using two tricks:

Bitfit directories have an array of bitfit views and a corresponding array of max_free bytes. Bitfit views are monomorphic, unlike the polymorphic views of segregated heaps. Each bitfit view is either uninitialized or has a bitfit page. A max_free byte for a page tells the maximum free object size in that page. So, in the worst case, we search the max_free vector to find the find byte that is large enough for our allocation.
Bitfit uses size classes to short-circuit the search. Bitfit leverages segregated heaps to create size classes. Segregated size directories choose at creation time if they want to support segregated allocation or bitfit allocation. If the latter, the directory is just used as a way of locating the bitfit size class. Like with segregated, each local allocator is associated with a segregated size directory, even if it's a local allocator configured for bitfit. Each size class maintains the index of the first view/page in the directory that has a free object big enough for that size class.

The updates to max_free and the short-circuiting indices in size classes happen when an allocation fails in a page. This is an ideal time to set those indices since failure to allocate happens to also tell you the size of the largest free object in the page.

When any object is freed in a page, we mark the page as having PAS_BITFIT_MAX_FREE_UNPROCESSED and rather than setting any short-circuiting indices in size classes, we just set the first_unprocessed_free index in the pas_bitfit_directory. Allocation will start its search from the minimum of directory->first_unprocessed_free and size_class->first_free. All of these short-circuiting indices use pas_versioned_field just like how short-circuiting works in segregated directories.

Bitfit heaps use fine-grained locking in the sense that each view has its own locks. But, there's no attempt made to have different threads avoid allocating in the same pages. Adding something like view caches to the bitfit heap is likely to make it much faster. You could even imagine that rather than having the directory->first_unprocessed_free, we instead have freeing in a page put the page's view onto a local view cache for that bitfit directory, and then we allocate out of the view cache until it's empty. Failure to allocate in a page in the view cache will then tell us the max_free, which will allow us to mark the view eligible in the directory.

The Scavenger¶

Libpas returns memory to the OS by madvising it. This makes sense for a malloc that is trying to give strong type guarantees. If we unmapped memory, then the memory could be used for some totally unrelated type in the future. But by just decommitting the memory, we get the memory savings from free pages of memory and we also get to preserve type safety.

The madvise system call -- and likely any mechanism for saying "this page is empty" -- is expensive enough that it doesn't make sense to do it anytime a page becomes empty. Pages often become empty only to refill again. In fact, last time I measured it, just about half of allocations went down the bump allocation path and (except for at start-up) that path is for completely empty pages. So, libpas has mechanisms for stashing the information that a page has become empty and then having a scavenger thread return that memory to the OS with an madvise call (or whatever mechanism). The scavenger thread is by default configured to run every 100ms, but will shut itself down if we have a period of non-use. At each tick, it returns all empty pages that have been empty for some length of time (currently 300ms). Those two thresholds -- the period and the target age for decommit -- are independently configurable at it might make sense (for different reasons) to have either number be bigger than the other number.

This section describes the scavenging algorithm is detail. This is a large fraction of what makes libpas fast and space-efficient. The algorithm has some crazy things in it that probably didn't work out as well as I wanted but nonetheless those things seem to avoid showing up in profiles. That's sort of the outcome of this algorithm being tuned and twisted so many times during the development of this allocator. First I'll describe the deferred decommit log, which is how we coalesce madvise calls. Then I'll describe the page sharing pool, which is a mechanism for multiple participants to report that they have some empty pages. Then I'll describe how the large heap implements this with the large sharing pool, which is one of the singleton participants. Then I'll describe the segregated directory participants -- which are a bit different for shared page directories versus size directories. I'll also describe the bitfit directory participants, which are quite close to their segregated directory cousins. Then I'll describe some of the things that the scavenger does that aren't to do with the page sharing pool, like stopping baseline allocators, stopping utility heap allocators, stopping TLC allocators, flushing TLC deallocation logs, decommitting unused parts of TLCs, and decommitting expendable memory.

Deferred Decommit Log¶

Libpas's decommit algorithm coalesces madvise calls -- so if two adjacent pages, even from totally unrelated heaps, become empty, then their decommit will be part of one syscall. This is achieved by having places in the code that want to decommit memory instead add that memory to a deferred decommit log. This log internally uses a minheap based on address. The log also stores what lock was needed to decommit the range, since the decommit algorithm relies on fine-grained locking of memory rather than having a global commit lock. So, when the scavenger asks the page sharing pool to go find some memory to decommit, it gives it a deferred decommit log. The page sharing pool will usually return all of the empty pages in one go, so the deferred decommit logs can get somewhat big. They are allocated out of the bootstrap free heap (which isn't the best idea if they ever get very big, since bootstrap free heap memory is not decommitted -- but right now this is convenient because for a heap to support decommit, it needs to talk to deferred decommit logs, so we want to avoid infinite recursion). After the log is filled up, we can decommit everything in the log at once. This involves heapifying the array and then scanning it backwards while detecting adjacent ranges. This is the loop that actually calls decommit. A second loop unlocks the locks.

Two fun complications arise:

As the page sharing pool scans memory for empty pages, it may arrive at pages in a random order, which may be different from any valid order in which to acquire commit locks. So, after acquiring the first commit lock, all subsequent lock acquisitions are try_lock's. If any try_lock fails, the algorithm returns early, and the deferred decommit log helps facilitate detecting when a lock failed to be acquired.
Libpas supports calling into the algorithm even if some commit locks, or the heap lock, are held. It's not legal to try to acquire any commit locks other than by try_lock in that case. The deferred decommit log will also make sure that it will not relock any commit locks that are already held.

Libpas can be configured to return memory either by madvising it or by mmap-zero-filling it, which has a similar semantic effect but is slower. Libpas supports both symmetric and asymmetric forms of madvise, though the asymmetric form is faster and has a slight (though mostly theoretical) memory usage edge. By asymmetric I mean that you call some form of madvise to decommit memory and then do nothing to commit it. This works on Darwin and it's quite efficient -- the kernel will clear the decommit request and give the page real memory if you access the page. The memory usage edge of not explicitly committing memory in the memory allocator is that programs may allocate large arrays and never use the whole array. It's OK to configure libpas to use symmetric decommit, but the asymmetric variant might be faster or more efficient, if the target OS allows it.

Different kinds of heaps have different ways of discovering that they have empty pages. Libpas supports three different kinds of heaps (large, segregated, and bitfit) and one of those heaps has two different ways of discovering free mempry (segregated shared page directories and segregated size directories). The page sharing pool is a data structure that can handle an arbitrary number of page sharing participants, each of which is able to say whether they have empty pages and whether those empty pages are old enough to be worth decommitting.

Page sharing participants need to be able to answer the following queries:

Getting the epoch of the oldest free page. This can be approximate; for example, it's OK for the participant to occasionally give an epoch that is newer than the true oldest so long as this doesn't happen all the time. We call this the use_epoch.
Telling if the participant thinks it is eligible, i.e. has any free pages right now. It's OK for this to return true even if there aren't free pages. It's not OK for this to return false if there are free pages. If this returns true and there are no free pages, then it must return false after a bounded number of calls to take_least_recently_used. Usually if a participant incorrectly says that it is eligible, then this state will clear with exactly one call to take_least_recently_used.
Taking the least recently used empty page (pas_page_sharing_participant_take_least_recently_used). This is allowed to return that there aren't actually any empty pages. If there are free pages, this is allowed to return any number of them (doesn't have to be just one). Pages are "returned" via a deferred decommit log.

Participants also notify the page sharing pool when they see a delta. Seeing a delta means one of:

The participant found a new free page and it previously thought that it didn't have any.
The participant has discovered that the oldest free page is older than previously thought.

It's OK to only report a delta if the participant knows that it was previously not advertising itself as being eligible, and it's also OK to report a delta every time that a free page is found. Most participants try to avoid reporting deltas unless they know that they were not previously eligible. However, some participants (the segregated and bitfit directories) are sloppy about reporting the epoch of the oldest free page. Those participants will conservatively report a delta anytime they think that their estimate of the oldest page's age has changed.

The page sharing pool itself comprises:

A segmented vector of page sharing participant pointers. Each pointer is tagged with the participant's type, which helps the pool decide how to get the use_epoch, decide whether the participant is eligible, and take free pages from it.
A bitvector of participants that have reported deltas.
A minheap of participants sorted by epoch of the oldest free memory.
The current_participant, which the page sharing pool will ask for pages before doing anything else, so long as there are no deltas and the current participant continues to be eligible and continues to report the same use epoch.

If the current participant is not set or does not meet the criteria, the heap lock is taken and all of the participants that have the delta bit set get reinserted into the minheap based on updated use epochs. It's possible that the delta bit causes the removal of entries in the minheap (if something stopped being eligible). It's possible that the delta bit causes the insertion of entries that weren't previously there (if something has just become eligible). And it's possible for an entry to be removed and then readded (if the use epoch changed). Then, the minimum of the minheap becomes the current participant, and we can ask it for pages.

The pool itself is a class but it happens to be a singleton right now. It's probably a good idea to keep it as a class, because as I've experimented with various approaches to organizing memory, I have had versions where there are many pools. There is always the physical page sharing pool (the singleton), but I once had sharing pools for moving pages between threads and sharing pools for trading virtual memory.

The page sharing pool exposes APIs for taking memory from the pool. There are two commonly used variants:

Take a set number of bytes. The page sharing pool will try to take that much memory unless a try_lock fails, in which case it records that it should take this additional amount of bytes on the next call.
Take all free pages that are same age or older than some max_epoch. This API is called pas_physical_page_sharing_pool_scavenge. This algorithm will only return when it is done. It will reloop in case of try_lock failure, and it does this in a way that avoids spinning.

The expected behavior is that when the page sharing pool has to get a bunch of pages it will usually get a run of them from some participant -- hence the emphasis on the current_participant. However, it's possible that various tweaks that I've made to the algorithm have made this no longer be the case. In that case, it might be worthwhile to try to come up with a way of recomputing he current_participant that doesn't require holding the heap_lock. Maybe even just a lock for the page sharing pool, rather than using the heap_lock, would be enough to get a speed-up, and in the best case that speed-up would make it easier to increase scavenger frequency.

The page sharing pool's scavenge API is the main part of what the scavenger does. The next sections describe the inner workings of the page sharing participants.

The large sharing pool is a singleton participant in the physical page sharing pool. It tracks which ranges of pages are empty across all of the large heaps. The idea is to allow large heaps to sometimes split a page among each other, so then no single large heap knows whether the page can be decommitted. So, all large heaps as well as the large utility free heap and compact large utility free heap report when they allocate and deallocate memory to the large sharing pool.

Internally, the large sharing pool maintains a red-black tree and minheap. The tree tracks coalesced ranges of pages and their states. The data structure thinks it knows about all of memory, and it "boots up" with a single node representing the whole address space and that node claims to be allocated and committed. When heaps that talk to the large sharing pool acquire memory, they tell the sharing pool that the memory is now free. Any free-and-committed ranges also reside in the minheap, which is ordered by use epoch (the time when the memory became free).

The large sharing pool can only be used while holding the heap_lock, but it uses a separate lock, the pas_virtual_range_common_lock, for commit and decommit. So, while libpas is blocked in the madvise syscall, it doesn't have to hold the heap_lock and the other lock only gets acquired when committing or decommitting large memory.

It's natural for the large sharing pool to handle the participant API:

The large sharing pool participant says it's eligible when the minheap is non-empty.
The large sharing pool participant reports the minheap's minimum use epoch as its use epoch (though it can be configured to do something else; that something else may not be interesting anymore).
The large sharing pool participant takes least recently used by removing the minimum from the minheap and adding that node's memory range to the deferred decommit log.

The large sharing pool registers itself with the physical page sharing pool when some large heap reports the first bit of memory to it.

Segregated Directory Participant¶

Each segregated directory is a page sharing participant. They register themselves once they have pages that could become empty. Segregated directories use the empty bits and the last-empty-plus-one index to satisfy the page sharing pool participant API:

A directory participant says it's eligible when last-empty-plus-one is nonzero.
A directory participant reports the use epoch of the last empty page before where last-empty-plus-one points. Note that in extreme cases this means having to search the bitvector for a set empty bit, since the last-empty-plus-one could lag in being set to a lower value in case of races.
A directory participant takes least recently used by searching backwards from last-empty-plus-one and taking the last empty page. This action also updates last-empty-plus-one using the pas_versioned_field lock-free tricks.

Once an empty page is found the basic idea is:

Clear the empty bit.
Try to take eligibility; i.e. make the page not eligible. We use the eligible bit as a kind of lock throughout the segregated heap; for example, pages will not be eligible if they are currently used by some local allocator. If this fails, we just return. Note that it's up to anyone who makes a page ineligible to then check if it should set the empty bit after they make it eligible again. So, it's fine for the scavenger to not set the empty bit again after clearing it and failing to take the eligible bit. This also prevents a spin where the scavenger keeps trying to look at this allegedly empty page even though it's not eligible. By clearing the empty bit and not setting it again in this case, the scavenger will avoid this page until it becomes eligible.
Grab the ownership lock to say that the page is now decommitted.
Grab the commit lock and then put the page on the deferred decommit log.
Make the page eligible again.

Sadly, it's a bit more complicated than that:

Directories don't track pages; they track views. Shared page directories have views of pages whose actual eligibility is covered by the eligible bits in the segregated size directories that hold the shared page's partial views. So, insofar as taking eligibility is part of the algorithm, shared page directories have to take eligibility for each partial view associated with the shared view.
Shared and exclusive views could both be in a state where they don't even have a page. So, before looking at anything about the view, it's necessary to take the ownership lock. In fact, to have a guarantee that nobody is messing with the page, we need to grab the ownership lock and take eligibility. The actual algorithm does these two things together.
It's possible for the page to not actually be empty even though the empty bit is set. We don't require that the empty bit is cleared when a page becomes nonempty.
Some of the logic of decommitting requires holding the page->lock_ptr, which may be a different lock than the ownership lock. So the algorithm actually takes the ownership lock, then the page lock, then the ownership lock again, and then the commit lock.
If a try_lock fails, then we set both the eligible and empty bits, since in that case, we really do want the page sharing pool to come back to us.

Some page configs support segregated pages that have multiple system pages inside them. In that case, the empty bit gets set when any system page becomes empty (using granule use counts), and the taking algorithm just decommits the granules rather than decommitting the whole page.

Bitfit Directory Participant¶

Bitfit directories also use an empty bitvector and also support granules. Although bitfit directories are an independent piece of code, their approach to participating in page sharing pools exactly mirrors what segregated directories do.

This concludes the discussion about the page sharing pool and its participants. Next, I will cover some of the other things that the scavenger does.

Stopping Baseline Allocators¶

The scavenger also routinely stops baseline allocators. Baseline allocators are easy to stop by the scavenger because they can be used by anyone who holds their lock. So the scavenger can stop any baseline allocator it wants. It will only stop those allocators that haven't been used in a while (by checking and resetting a dirty bit).

Stopping Utility Heap Allocators¶

The scavenger also does the same thing for utility allocators. This just requires holding the heap_lock.

Stopping TLC Allocators¶

Allocators in TLCs also get stopped, but this requires more effort. When the scavenger thread is running, it's possible for any thread to be using any of its allocators and no locks are held when this happens. So, the scavenger uses the following algorithm:

First it tries to ask the thread to stop certain allocators. In each allocation slow path, the allocator checks if any of the other allocators in the TLC have been requested to stop by the scavenger, and if so, it stops those allocators. That doesn't require special synchronization because the thread that owns the allocator is the one stopping it.
If that doesn't work out, the scavenger suspends the thread and stops all allocators that don't have the in_use bit set. The in_use bit is set whenever a thread does anything to a local allocator, and cleared after.

One wrinkle about stopping allocators is that stopped allocators might get decommitted. The way that this is coordinated is that a stopped allocator is in a special state that means that any allocation attempt will take a slow path that acquires the TLC's scavenging lock and possibly recommits some pages and then puts the allocator back into a normal state.

Flushing TLC Deallocation Logs¶

The TLC deallocation logs can be flushed by any thread that holds the scavenging lock. So, the scavenger thread flushes all deallocation logs that haven't been flushed recently.

When a thread flushes its own log, it holds the scavenger lock. However, appending doesn't grab the lock. To make this work, when the scavenger flushes the log, it:

Replaces any entry in the log that it deallocated with zero. Actually, the deallocation log flush always does this.
Does not reset the thread_local_cache->deallocation_log_index. In fact, it doesn't do anything except read the field.

Because of the structure of the deallocation log flush, it's cheap for it to null-check everything it loads from the deallocation log. So when, a thread goes to flush its log after the scavenger has done it, it will see a bunch of null entries, and it will skip them. If a thread tries to append to the deallocation log while the scavenger is flushing it, then this just works, because it ends up storing the new value above what the scavenger sees. The object is still in the log, and will get deallocated on the next flush.

Decommitting Unused Parts of TLCs¶

For any page in the TLC consisting entirely of stopped allocators, the scavenger can decommit those pages. The zero value triggers a slow path in the local allocator and local view cache that then commits the page and rematerializes the allocator. This is painstakingly ensured; to keep this property you generally have to audit the fast paths to see which parts of allocators they access, and make sure that the allocator goes to a slow path if they are all zero. That slow path then has to check if the allocator is stopped or decommitted, and if it is either, it grabs the TLC's scavenger lock and recommits and rematerializes.

This feature is particularly valuable because of how big local allocators and local view caches are. When there are a lot of heaps, this accounts for tens of MBs in some cases. So, being able to decommit unused parts is a big deal.

Decommitting Expendable Memory¶

Segregated heaps maintain lookup tables that map "index", i.e. the object size divided by the heap's minalign, to allocator indices and directories. There are three tables:

The index-to-allocator-index table. This only works for small-enough indices.
The index-to-directory table. This also only works for small-enough indices.
The medium index-to-directory-tuple binary search array. This works for the not-small-enough indices.

It makes sense to let those tables get large. It might make sense for each isoheap to have a large lookup table. Currently, they use some smaller threshold. But to make that not use a lot of memory, we need to be able to decommit memory that is only used by lookup tables of heaps that nobody is using.

So, libpas has this weird thing called pas_expendable_memory, which allows us to allocate objects of immortal memory that automatically get decommitted if we don't "touch" them frequently enough. The scavenger checks the state of all expendable memory, and decommits those pages that haven't been used in a while. The expendable memory algorithm is not wired up as a page sharing participant because the scavenger really needs to poke the whole table maintained by the algorithm every time it does a tick; otherwise the algorithm would not work. Fortunately, there's never a lot of this kind of memory. So, this isn't a performance problem as far as I know. Also, it doesn't actually save that much memory right now -- but also, right now isoheaps use smaller-size tables.

This concludes the discussion of the scavenger. To summarize, the scavenger periodically:

Stops baseline allocators.
Stops utility heap allocators.
Stops TLC allocators.
Flushes TLC deallocation logs.
Decommits unused parts of TLCs.
Decommits expendable memory.
Asks the physical page sharing pool to scavenge.

While it does these things, it notes whether anything is left in a state where it would be worthwhile to run again. A steady state might look like that all empty pages have been decommitted, rather than that only old enough ones were decommitted. If a steady state looks like it's being reached, the scavenger first sleeps for a while, and then shuts down entirely.

Libpas provides non-large heaps two fast and scalable ways of figuring out about an object based on its address, like during deallocation, or during user queries like malloc_size.

Megapages and the megapage table. Libpas describes every 16MB of memory using a two-bit enum called pas_fast_megapage_kind. The zero state indicates that this address does not have a fast megapage. The other two describe two different kinds of fast megapages, one where you know exactly the type of page it is (small exclusive segregated) and one where you have to find out by asking the page header, but at least you know that it's small (so usually 16KB).
Page header table. This is a lock-free-to-read hashtable that tells you where to find the page header for some page in memory. For pages that use page headers, we also use the page header table to find out if the memory address is in memory owned by some kind of "page" (either a segregated page or a bitfit page).

Usually, the small segregated and small bitfit configs use megapages. The medium segregated, medium bitfit, and marge bitfit configs use page header tables. Large heaps don't use either, since they have the large map.

The Enumerator¶

Libpas supports the libmalloc enumerator API. It even includes tests for it. The way that this is implemented revolves around the pas_enumerator class. Many of the data structures that are needed by the enumerator have APIs to support enumeration that take a pas_enumerator*. The idea behind how it works is conceptually easy: just walk the heap -- which is possible to do accurately in libpas -- and report the metadata areas, the areas that could contain objects, and the live objects. But, we have to do this while walking a heap in a foreign process, and we get to see that heap through little copies of it that we request a callback to make for us. The code for enumeration isn't all that pretty but at least it's easy to find most of it by looking for functions and files that refer to pas_enumerator.

But a challenge of enumeration is that it happens when the remote process is stopped at some arbitrary point. That point has to be an instruction boundary -- so CPU memory model issues aren't a problem. But this means that the whole algorithm has to ensure that at any instruction boundary, the enumerator will see the right thing. Logic to help the enumerator still understand the heap at any instruction is spread throughout the allocation algorithms. Even the trees and hashtables used by the large heap have special hacks to enable them to be enumerable at any instruction boundary.

The enumerator is maintained so that it's 100% accurate. Any discrepancy -- either in what objects are reported live or what their sizes are -- gets flagged as an error by test_pas. The goal should be to maintain perfect enumerator accuracy. This makes sense for two reasons:

I've yet to see a case where doing so is a performance or memory regression.
It makes the enumerator easy to test. I don't know how to test an enumerator that is not accurate by design. If it's accurate by design, then any discrepancy between the test's understanding of what is live and what the enumerator reports can be flagged as a test failure. If it wasn't accurate by design, then I don't know what it would mean for a test to fail or what the tests could even assert.

The Basic Configuration Template¶

Libpas's heap configs and page configs allow for a tremendous amount of flexibility. Things like the utility heap and the jit_heap leverage this flexibility to do strange things. However, if you're using libpas to create a normal malloc, then a lot of the configurability in the heap/page configs is too much. A "normal" malloc is one that is exposed as a normal API rather than being internal to libpas, and that manages memory that doesn't have special properties like that it's marked executable and not writeable.

The basic template is provided by pas_heap_config_utils.h. To define a new config based on this template, you need to:

Add the appropriate heap config and page config kinds to pas_heap_config_kind.def, pas_segregated_page_config_kind.def, and pas_bitfit_page_config_kind.def. You also have to do this if you add any kind of config, even one that doesn't use the template.
Create the files foo_heap_config.h and foo_heap_config.c. These are mostly boilerplate.

The header file usually looks like this:

#define ISO_MINALIGN_SHIFT ((size_t)4)
#define ISO_MINALIGN_SIZE ((size_t)1 << ISO_MINALIGN_SHIFT)

#define ISO_HEAP_CONFIG PAS_BASIC_HEAP_CONFIG( \
    iso, \
    .activate = pas_heap_config_utils_null_activate, \
    .get_type_size = pas_simple_type_as_heap_type_get_type_size, \
    .get_type_alignment = pas_simple_type_as_heap_type_get_type_alignment, \
    .dump_type = pas_simple_type_as_heap_type_dump, \
    .check_deallocation = true, \
    .small_segregated_min_align_shift = ISO_MINALIGN_SHIFT, \
    .small_segregated_sharing_shift = PAS_SMALL_SHARING_SHIFT, \
    .small_segregated_page_size = PAS_SMALL_PAGE_DEFAULT_SIZE, \
    .small_segregated_wasteage_handicap = PAS_SMALL_PAGE_HANDICAP, \
    .small_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_oblivious_logging_mode, \
    .small_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
    .small_exclusive_segregated_enable_empty_word_eligibility_optimization = false, \
    .small_shared_segregated_enable_empty_word_eligibility_optimization = false, \
    .small_segregated_use_reversed_current_word = PAS_ARM64, \
    .enable_view_cache = false, \
    .use_small_bitfit = true, \
    .small_bitfit_min_align_shift = ISO_MINALIGN_SHIFT, \
    .small_bitfit_page_size = PAS_SMALL_BITFIT_PAGE_DEFAULT_SIZE, \
    .medium_page_size = PAS_MEDIUM_PAGE_DEFAULT_SIZE, \
    .granule_size = PAS_GRANULE_DEFAULT_SIZE, \
    .use_medium_segregated = true, \
    .medium_segregated_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
    .medium_segregated_sharing_shift = PAS_MEDIUM_SHARING_SHIFT, \
    .medium_segregated_wasteage_handicap = PAS_MEDIUM_PAGE_HANDICAP, \
    .medium_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_aware_logging_mode, \
    .medium_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
    .use_medium_bitfit = true, \
    .medium_bitfit_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
    .use_marge_bitfit = true, \
    .marge_bitfit_min_align_shift = PAS_MIN_MARGE_ALIGN_SHIFT, \
    .marge_bitfit_page_size = PAS_MARGE_PAGE_DEFAULT_SIZE, \
    .pgm_enabled = false)

PAS_API extern pas_heap_config iso_heap_config;

PAS_BASIC_HEAP_CONFIG_DECLARATIONS(iso, ISO);

Note the use of PAS_BASIC_HEAP_CONFIG, which creates a config literal that automatically fills in a bunch of heap config, segregated page config, and bitfit page config fields based on the arguments you pass to PAS_BASIC_HEAP_CONFIG. The corresponding .c file looks like this:

pas_heap_config iso_heap_config = ISO_HEAP_CONFIG;

PAS_BASIC_HEAP_CONFIG_DEFINITIONS(
    iso, ISO,
    .allocate_page_should_zero = false,
    .intrinsic_view_cache_capacity = pas_heap_runtime_config_zero_view_cache_capacity);

Note that this just configures whether new pages are zeroed and what the view cache capacity for the intrinsic heap are. The intrinsic heap is one of the four categories of heaps that the basic heap configuration template supports:

Intrinsic heaps are global singleton heaps, like the common heap for primitives. WebKit's fastMalloc bottoms out in an intrinsic heap.
Primitive heaps are heaps for primitive untyped values, but that aren't singletons. You can have many primitive heaps.
Typed heaps have a type, and the type has a fixed size and alignment. Typed heaps allow allocating single instances of objects of that type or arrays of that type.
Flex heaps are for objects with flexible array members. They pretend as if their type has size and alignment equal to 1, but in practice they are used for objects that have some base size plus a variable-length array. Note that libpas doesn't correctly manage flex memory in the large heap; we need a variant of the large heap that knows that you cannot reuse flex memory between different sizes.

The basic heap config template sets up some basic defaults for how heaps work:

It makes small segregated and small bitfit page configs put the page header at the beginning of the page and it arranges to have those pages allocated out of megapages.
It makes medium segregated, medium bitfit, and marge bitfit use page header tables.
It sets up a way to find things like the page header tables from the enumerator.
It sets up segregated shared page directories for each of the segregated page configs.

The bmalloc_heap_config is an example of a configuration that uses the basic template. If we ever wanted to put libpas into some other malloc library, we'd probably create a heap config for that library, and we would probably base it on the basic heap config template (though we don't absolutely have to).

JIT Heap Config¶

The JIT heap config is for replacing the MetaAllocator as a way of doing executable memory allocation in WebKit. It needs to satisfy two requirements of executable memory allocation:

The allocator cannot read or write the memory it manages, since that memory may have weird permissions at any time.
Clients of the executable allocator must be able to in-place shrink allocations.

The large heap trivially supports both requirements. The bitfit heap trivially supports the second requirement, and can be made to support the first requirement if we use page header tables for all kinds of memory, not just medium or marge. So, the JIT heap config focuses on just using bitfit and large and it forces bitfit to use page header tables even for the small bitfit page config.

Security Considerations¶

Probabilistic Guard Malloc¶

Probabilistic Guard Malloc (PGM) is a new allocator designed to catch use after free attempts and out of bounds accesses. It behaves similarly to AddressSanitizer (ASAN), but aims to have minimal runtime overhead.

Allocation¶

Each time an allocation is performed an additional guard page is added above and below the newly allocated page(s). An allocation may span multiple pages. Allocations are either left or right aligned at random, which ensures the catching of both overflow and underflow errors.

Deallocation¶

When a deallocation is performed, the page(s) allocated will be protected using mprotect to ensure that any use after frees will trigger a crash. Virtual memory addresses are never reused, so we will never run into a case where object 1 is freed, object 2 is allocated over the same address space, and object 1 then accesses the memory address space of now object 2.

Memory Usage¶

PGM does add notable memory overhead. Each allocation, no matter the size, adds an additional 2 guard pages (8KB for X86_64 and 32KB for ARM64). In addition, there may be free memory left over in the page(s) allocated for the user. This memory may not be used by any other allocation.

We added limits on virtual memory and wasted memory to help limit the memory impact on the overall system. Virtual memory for this allocator is limited to 1GB. Wasted memory, which is the unused memory in the page(s) allocated by the user, is limited to 1MB. These overall limits should ensure that the memory impact on the system is minimal, while helping to tackle the problems of catching use after frees and out of bounds accesses.

The Fast Paths¶

All of the discussion in the previous sections is about the innards of libpas. But ultimately, clients want to just call malloc-like and free-like functions to manage memory. Libpas provides fast path templates that actual heap implementations reuse to provide malloc/free functions. The fast paths are:

pas_try_allocate.h, which is the single object allocation fast path for isoheaps. This function just takes a heap and no size; it allocates one object of the size and alignment that the heap's type wants.
pas_try_allocate_array.h, which is the array and aligned allocation fast path for isoheaps. You want to use it with heaps that have a type, and that type has a size and alignment, and you want to allocate arrays of that type or instances of that type with special alignment.
pas_try_allocate_primitive.h, which is the primitive object allocation fast path for heaps that don't have a type (i.e. they have the primitive type as their type -- the type says it has size and alignment equal to 1).
pas_try_allocate_intrinsic.h, which is the intrinsic heap allocation fast path.
pas_try_reallocate.h, which provides variants of all of the allocators that reallocate memory.
pas_deallocate.h, which provides the fast path for free.
pas_get_allocation_size.h, which is the fast path for malloc_size.

One thing to remember when dealing with the fast paths is that they are engineered so that malloc/free functions do not have a stack frame, no callee saves, and don't need to save the LR/FP to the stack. To facilitate this, we have the fast path call an inline-only fast path, and if that fails, we call a "casual case". The inline-only fast path makes no out-of-line function calls, since if it did, we'd need a stack frame. The only slow call (to the casual case) is a tail call. For example:

static PAS_ALWAYS_INLINE void* bmalloc_try_allocate_inline(size_t size)
{
    pas_allocation_result result;
    result = bmalloc_try_allocate_impl_inline_only(size, 1);
    if (PAS_LIKELY(result.did_succeed))
        return (void*)result.begin;
    return bmalloc_try_allocate_casual(size);
}

The way that the bmalloc_try_allocate_impl_inline_only and bmalloc_try_allocate_casual functions are created is with:

PAS_CREATE_TRY_ALLOCATE_INTRINSIC(
    bmalloc_try_allocate_impl,
    BMALLOC_HEAP_CONFIG,
    &bmalloc_intrinsic_runtime_config.base,
    &bmalloc_allocator_counts,
    pas_allocation_result_identity,
    &bmalloc_common_primitive_heap,
    &bmalloc_common_primitive_heap_support,
    pas_intrinsic_heap_is_designated);

All allocation fast paths require this kind of macro that creates a bunch of functions -- both the inline paths and the out-of-line paths. The deallocation, reallocation, and other fast paths are simpler. For example, deallocation is just:

static PAS_ALWAYS_INLINE void bmalloc_deallocate_inline(void* ptr)
{
    pas_deallocate(ptr, BMALLOC_HEAP_CONFIG);
}

If you look at pas_deallocate, you'll see cleverness that ensures that the slow path call is a tail call, similarly to how allocators work. However, for deallocation, I haven't had the need to make the slow call explicit in the client side (the way that bmalloc_try_allocate_inline has to explicitly call the slow path).

This concludes the discussion of libpas design.

Testing Libpas¶

I've tried to write test cases for every behavior in libpas, to the point that you should feel comfortable dropping a new libpas in WebKit (or wherever) if test_pas passes.

test_pas is a white-box component, regression, and unit test suite. It's allowed to call any libpas function, even internal functions, and sometimes functions that libpas exposes only for the test suite.

Libpas testing errs on the side of being comprehensive even if this creates annoying situations. Many tests assert detailed things about how many objects fit in a page, what an object's offset is in a page, and things like that. This means that some behavior changes in libpas that aren't in any way wrong will set off errors in the test suite. So, it's common to have to rebase tests when making libpas changes.

The libpas test suite is written in C++ and uses its own test harness that forks for each test, so each test runs in a totally pristine state. Also, the test suite can use malloc and new as much as it likes, since in the test suite, libpas does not replace malloc and free.

The most important libpas tests are the so-called chaos tests, which randomly create and destroy objects and assert that the heap's state is still sane (like that no live objects overlap, that all live objects are enumerable, etc).

Conclusion¶

Libpas is a beast of a malloc, designed for speed, memory efficiency, and type safety. May whoever maintains it find some joy in this insane codebase!