Recently, I came across this kernelCTF submission where the author mentions a novel technique for extending race windows in the Linux kernel:
I learned afterwards that this isn't a novel technique, as it was covered in this blog post by Starlabs. This blog post also explains a bit more than the kernelCTF submission about how the technique can be used.
However, neither of these two resources were enough for me. I wanted to really understand how this technique worked... So that's what I ended up doing!
This is going to be a short and simple blog post (mostly for myself) that covers what I learned.
Table of Contents
PoC and TL;DR
Here is a minimal PoC I wrote to showcase this technique on Ubuntu 25.04 (kernel 6.14.11). Comment out the second fallocate() call to see the difference in access times.
#define _GNU_SOURCE
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <linux/falloc.h>
#include <err.h>
#include <pthread.h>
#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})
#define FALLOC_LEN 64 * 1024 * 1024 // Max on Ubuntu 25.04
pthread_barrier_t barrier;
int ffd;
char *fmap;
void hole_punch() {
SYSCHK(ffd = open("/dev/shm/", O_TMPFILE | O_RDWR, 0666));
SYSCHK(fallocate(ffd, 0, 0, FALLOC_LEN));
SYSCHK(fmap = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, ffd, 0));
pthread_barrier_wait(&barrier); // barrier 1
SYSCHK(fallocate(ffd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, FALLOC_LEN));
}
int main(void) {
pthread_barrier_init(&barrier, NULL, 2);
pthread_t hole_punch_thread;
SYSCHK(pthread_create(&hole_punch_thread, NULL, (void*)hole_punch, NULL));
struct timespec start, end;
pthread_barrier_wait(&barrier); // barrier 1
clock_gettime(CLOCK_MONOTONIC, &start);
volatile char dummy = *(char *)fmap; // Simulated kernel access during hole punch
clock_gettime(CLOCK_MONOTONIC, &end);
long long stall_ns = (end.tv_sec - start.tv_sec) * 1000000000LL + (end.tv_nsec - start.tv_nsec);
printf("Stall time: %lld ns\n", stall_ns);
}The short explanation:
- Opening
/dev/shmand passing the FD tofallocate()callsshmem_fallocate(). - The
FALLOC_FL_PUNCH_HOLEflag causesshmem_fallocate()to unmap the address range allocated by the firstfallocate()call, as well as deallocate the actual physical pages back to the system. This takes a not-insignificant amount of time. - If a page fault occurs concurrently during this unmap and deallocation process (for example, when
fmapis accessed inmain()),shmem_fault()is called. - If
shmem_fault()notices thatfallocate()is in the middle of a hole punch, it callsshmem_falloc_wait(), which sleeps and waits for the hole punching to finish.
The key is in step 3 and 4. If a kernel thread accesses this mapping and triggers shmem_fault() while a hole punch is in progress, it will be forced to sleep. The amount of time it sleeps will be significantly longer than the time taken by any typical page fault, allowing race windows to be extended if the page fault occurs within one.
Introducing /dev/shm
In Linux, /dev/shm is a RAM based filesystem that can be used for low latency temporary file storage. It is also commonly used for IPC through shared memory.
In the context of this blog post, you can treat it like any other file on the system. You can open /dev/shm, and map pages from it into your process for reading and writing. The key difference is that page faults and certain system calls are handled by handlers specific to /dev/shm.
Allocating Pages Using fallocate()
Opening /dev/shm and mapping a page into memory might look like this:
int ffd = open("/dev/shm/", O_TMPFILE | O_RDWR, 0666);
void *fmap = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, ffd, 0);At this point, a physical page has not been allocated yet. The allocation will only happen if you attempt to access fmap:
// This triggers a page fault, which will allocate a physical page
// for this mapping.
*(int *)fmap = 0;For /dev/shm, shmem_fault() will be called to handle this page fault, which calls shmem_get_folio_gfp(). The comment above shmem_get_folio_gfp() tells us all we need to know:
/*
* shmem_get_folio_gfp - find page in cache, or get from swap, or allocateThere are scenarios where the page fault can fail. For example, if the system is out of disk space, the page fault will trigger an OOM error when it fails to allocate a page.
I studied the code in do_user_addr_fault() that handles this error, and it seems like such an OOM error cannot be handled from userland at all.
To deal with such situations, the fallocate() system call can be used to immediately allocate physical pages for the opened file. If fallocate() cannot satisfy the request (for example, there is not enough disk space), it will return an error to the user and set errno, which allows the error to be handled much more gracefully from userland.
Note that for /dev/shm specifically, shmem_fallocate() is used to handle the fallocate() system call.
Deallocating Pages Using FALLOC_FL_PUNCH_HOLE
The fallocate() system call can also be used to deallocate the physical pages that were allocated. This is done by passing in the FALLOC_FL_PUNCH_HOLE flag, which also requires the FALLOC_FL_KEEP_SIZE flag to be set.
Let's take a look at what deallocation does in shmem_fallocate():
static long shmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
// [ ... ]
if (mode & FALLOC_FL_PUNCH_HOLE) {
// [ ... ]
inode->i_private = &shmem_falloc; // [ 1 ]
spin_unlock(&inode->i_lock);
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start, // [ 2 ]
1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1); // [ 3 ]
/* No need to unmap again: hole-punching leaves COWed pages */
spin_lock(&inode->i_lock);
inode->i_private = NULL;
wake_up_all(&shmem_falloc_waitq); // [ 4 ]
WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.head));
spin_unlock(&inode->i_lock);
error = 0;
goto out;
}
// [ ... ]
}I've inserted annotations into the code:
[ 1 ]-inode->i_private = &shmem_fallocis set. This signifies to other threads that a hole punch is in progress.[ 2 ]-unmap_mapping_range()is called to remove all associated page table entries within the memory range provided. This effectively unmaps the pages from all processes.[ 3 ]-shmem_truncate_range()is called to deallocate and free the allocated physical pages back to the page allocator.[ 4 ]-wake_up_all()wakes up any other threads that are waiting for the hole punch to finish.
Once the pages have been deallocated, any subsequent access to the mappings will be handled by a page fault the old-fashioned way.
Blocking Other Threads
As mentioned above, there are scenarios where other threads have to wait for the hole punching to finish before they can continue with whatever it is they were doing. One such scenario happens in shmem_fault()!
static vm_fault_t shmem_fault(struct vm_fault *vmf)
{
// [ ... ]
/*
* Trinity finds that probing a hole which tmpfs is punching can
* prevent the hole-punch from ever completing: noted in i_private.
*/
if (unlikely(inode->i_private)) { // [ 1 ]
ret = shmem_falloc_wait(vmf, inode);
if (ret)
return ret;
}
// [ ... ]
}Remember how shmem_fallocate() sets inode->i_private to signify that a hole punch is in progress? We can see that shmem_fault() handles this exact situation at [ 1 ] by calling shmem_falloc_wait(). This causes this thread to go to sleep until it is woken up later by shmem_fallocate() after the hole punch has finished.
Extending Race Windows
So far, we've learned the following:
shmem_fault()is called to handle page faults for pages mapped via/dev/shm.- If physical pages are allocated via
fallocate(), they can also be deallocated viafallocate()usingFALLOC_FL_PUNCH_HOLE. - If a page fault occurs while page deallocation (via
fallocate()) is in progress,shmem_fault()will sleep.
We can make use of this primitive to extend certain race windows in the Linux kernel. The only pre-condition is that there must be an access to a userspace address inside the race window. This may happen, for example, if the kernel calls copy_from_user() or copy_to_user() to read from or write to a userspace address.
So, if you find a race condition vulnerability, and you notice that the race window meets this pre-condition, you can do the following to extend it:
- Open
/dev/shmandfallocate()the largest amount of memory you're able to (on Ubuntu 25.04, this is 64 MB). - On thread
T1, trigger the kernel code path that enters the race window. - On thread
T2, callfallocate()with theFALLOC_FL_PUNCH_HOLEflag to deallocate the memory you allocated in step 1.
If step 3 triggers right before the call to copy_to_user() / copy_from_user() inside the race window, T1 will go to sleep and wait for T2 to finish. At this point, the race window is extended, and you will have a much higher chance of triggering the vulnerability (which you may or may not have to do from a separate thread).
Conclusion
The PoC I showed in the PoC and TL;DR section can be adapted for any Linux kernel exploit. Simply replace the access to fmap in the main() function to whatever causes the kernel to enter the vulnerable race window.
If you have any questions, feel free to DM me on X!