Running log of non-obvious things about this codebase for future humans and AI agents working in this repo. Append entries; don't rewrite history. Keep it factual. Read this before diving in.
# The core integration suite — probe + torture are now Go tests in
# ublk/integration_*_test.go, invoked by make test-integration.
make test-integration
# Standalone harnesses (still example/* binaries for now):
make chain # sudo needed; stacks two ublks (proxy -> storage)
make stress # sudo needed; race-detector stress (create/close churn, IO-while-close, etc.)
make fault # sudo needed; Backend returns EIO; verifies errors propagate to userspace
make sigkill # sudo needed; child process killed mid-I/O; verifies kernel cleanup
make flushbench # sudo needed; microsecond trace of backend calls during flush operations
# Torture long soak: UBLK_TORTURE_DURATION=30m go test -tags=integration -run TestTortureRandomIO ...The probe (example/probe/main.go) exercises both sides of the stack:
- Device-level (direct I/O, bypasses page cache):
BLKGETSIZE64size check; pre-mkfs zero-read; random-block write/read roundtrip that also verifies the backend's raw storage holds the same bytes at the same offset (proves kernel ↔ userspace offset mapping is 1:1). - Filesystem-level:
mkfs.ext4→ mount → scripted write +sync -f(asserts backend writes > 0) →fsyncalone (also asserts backend writes > 0) → drop caches + readback (asserts backend reads > 0) → scan the backend for the magic pattern (proves filesystem reads ultimately come from our in-memory storage) → concurrent writers → remount (journal replay) → umount → close → verify/dev/ublkbNgone.
If a step hangs beyond the timeout the probe panics, which prints a
full goroutine dump from the Go runtime — this is the single most useful
artifact when diagnosing ublk-level stalls, because it tells you whether
the worker is blocked in WaitCQE, inside Backend.*, or elsewhere.
make chain (example/chain/main.go) creates two ublk devices in the
same process: a storage ublk with an in-memory backend, then a proxy
ublk whose Backend forwards Pread/Pwrite calls to the storage's
block device (opened O_DIRECT). I/O written to the proxy's block
device must appear byte-for-byte at the same offset in the storage's
in-memory backend. This validates two complete ublk stacks running
side-by-side, two LockOSThread'd workers, and cross-device data
integrity. If this test passes, composition works.
make torture (example/torture/main.go) is the fuzz-style integrity
test. Each of N worker goroutines owns a disjoint region of the device;
each picks a random (offset, length) inside its region and a random
direction (read or write); on write it updates an in-memory shadow of
what the device should contain; on read it compares the result against
the shadow and fails the run (non-zero exit, with first-differing byte
offset) on any mismatch. Periodic fsync and full-region reverify runs
exercise the write-through and journaling paths. Run for minutes, not
seconds, to find subtle ordering bugs.
make fault (example/fault/main.go) injects backend errors at a
configurable rate and checks they propagate all the way up to
Pwrite/Pread on /dev/ublkbN. The scenarios cover low-rate
failures (10%), total write/read failure (100%), and the
often-forgotten "Close() with pending errors" case — which must not
hang.
make sigkill (example/sigkill/main.go) spawns a child process,
kills it with SIGKILL mid-I/O, and verifies the kernel's own cleanup
path (ublk_ch_release on fd close) is sufficient to remove the device
nodes and free whatever state the kernel holds. The parent then
creates a fresh device to confirm no leak. Matters because SIGKILL
bypasses every Go-level cleanup (defer, sync.Once, etc.) — the kernel
is the only thing protecting us.
make stress (example/stress/main.go) runs four stressors against
-race-instrumented library code:
- churn — tight
New→small-I/O→Closeloop, catches leaks and shutdown-order races. - ioWhileClose — I/O goroutines hammer the block device;
Close()mid-stream. Catches races between worker cleanup and in-flight I/O. - concurrentClose — N goroutines call
Close()at once. Confirms thesync.Onceguard is sufficient. - many — N devices alive simultaneously with writer goroutines, closed in parallel. Catches cross-device state bleed.
Any race-detector warning fails the run (non-zero exit). Run for
longer (-duration 5m) before a release or after touching shutdown
code.
Other diagnostic commands:
pgrep -af 'example/probe' | awk '{print $1}' | xargs -r sudo kill -SIGQUIT # manual stack dump
cat /sys/class/block/ublkb*/stat # block stats
sudo dmesg | tail -40 # kernel messages (ublk_drv logs here)devInfo.DevIDmust matchctrlCmd.DevID(kernel 6.17+ validation). We set both to^uint32(0)to request auto-assign. Previous code only set it in the ctrlCmd, which started returningEINVALon 6.17.ADD_DEVhas two encodings. The ioctl-encoded command (uCmdAddDev) is newer; the legacycmdAddDevis tried as fallback. ExpectENOTSUPfrom the legacy path on modern kernels — that's normal, just means the first path succeeded.FETCH_REQis processed as deferred task work starting around 6.13. It only completes when the io_uring is entered withIORING_ENTER_GETEVENTS. That is whyworker.run()submits viaSubmitAndWait()(which passes that flag), notSubmit(). Using plainSubmit()leavesSTART_DEVhanging in the kernel waiting for the fetch to complete.- Control ring uses
SQE128(forURING_CMDpassthrough ofublksrv_ctrl_cmd, which sits insqe->cmdat offset 48). Data ring usesSQE64and packs the 16-byteublksrv_io_cmdinto the trailingCmdfield.
- Each worker must call
runtime.LockOSThread()before its firstio_uring_enter. ublk binds IO credentials to the thread that first submittedFETCH_REQ. If a goroutine gets migrated between threads, subsequent submissions fail or go to the wrong queue. FETCH_REQmust be submitted beforeSTART_DEVis issued (kernel blocksSTART_DEVuntil the fetches arrive). The worker signals readiness through a channel after its firstSubmitAndWait()so the main goroutine can proceed toSTART_DEV. See the comments inworker.run().
The library's Device.Close issues UBLK_CMD_DEL_DEV, which is
fundamentally del_gendisk() inside the kernel. del_gendisk() blocks
until all open fds to /dev/ublkbN are released — this is standard
block-device teardown semantics, not a ublk quirk.
So: if a user opens /dev/ublkbN for their own I/O, they must
unix.Close(fd) before calling dev.Close(). Otherwise dev.Close
hangs indefinitely waiting for del_gendisk.
This almost took us out twice — once as "sync got stuck during fsdemo",
once as "ioWhileClose hangs". The fix in our test harnesses is to close
user fds first, then call Device.Close. For users of the library,
this needs to be documented clearly (README API section is a good
place; not yet done).
If someone wants a "force close even with open fds" behaviour in the
library, the options are: (a) have the library track user fds — a
major API change and leaky abstraction; (b) switch to
UBLK_CMD_DEL_DEV_ASYNC (kernel 6.1+), which marks the device for
deletion but returns immediately — /dev/ublkbN disappears later
when the refcount drops. The async variant is a better default but
changes Close's semantics (it no longer guarantees the node is gone
on return). Not implemented; worth considering if users hit this.
Clean Close always works. What leaves orphan device nodes is
ungraceful termination where Device.Close() is never called:
- Ctrl+C on a harness that doesn't trap SIGINT: Go's default
handler exits without running
defers. The long-running harnesses (stress,torture,flushbench) now trap SIGINT/SIGTERM and return cleanly, sodefer dev.Close()runs. Short-running ones (probe,chain,fault) don't need it — they're too fast to matter in practice. - SIGKILL or crash: deliberately tested by
make sigkill. The kernel must clean up on its own viaublk_ch_releaseon fd close. On kernel 6.17.0 that path can wedge processes inDstate, leaving orphan nodes until reboot. Unknown whether a specific upstream fix exists; ublk was heavily refactored Sept 2025 and 6.18 stable has multiple fixes. TODO.md tracks.
The library-correctness question "can we keep using the API after an
ungraceful kill" is distinct from and answerable: make sigkill
verifies that attempting ublk.New in the parent after the child is
SIGKILL'd succeeds (observed: 11 ms on kernel 6.17.0). The kernel
allocates new minor numbers independently of whether old ones have
been reclaimed, until the ublks_max limit (default 64) is hit.
If you accumulate stale nodes, reboot — no userspace cleanup works reliably.
# how many ublk minors are consumed vs the kernel's limit
ls /sys/class/ublk-char/ | wc -l
cat /sys/module/ublk_drv/parameters/ublks_max
# module refcount (grows over a session, never shrinks if devices leak)
lsmod | awk '$1 == "ublk_drv"'
# find the stuck processes — they'll be in D (uninterruptible) state
ps -eo pid,state,cmd | awk '$2 == "D"' | grep -i ublk
# what kernel routine is a stuck process waiting on
sudo cat /proc/<PID>/stack
cat /proc/<PID>/wchan
# recent ublk messages
journalctl -k --since '5 minutes ago' | grep -i ublkIf /proc/<PID>/stack shows frames inside ublk_ch_release or
blk_mq_quiesce_queue or similar, that's the smoking gun for the
kernel-side hang. Take the output, report to linux-block mailing list
(and cc: Ming Lei — ming.lei@redhat.com, the driver maintainer), and
reboot the machine.
See TODO.md for a planned upstream repro.
Ring.Cancel() uses an eventfd/epoll wakeup to break a blocked
WaitCQE. But WaitCQE has a fast-path that returns an already-queued
CQE without ever calling epoll_wait — and under sustained I/O
pressure the CQ is always non-empty when the worker re-enters
WaitCQE. Without an additional cancel-flag check the worker never
observes the cancel signal and Device.shutdown hangs forever.
Fix (current): Ring.cancelled (atomic.Bool) set by Cancel(),
checked at the top of every WaitCQE iteration. The eventfd+epoll
setup stays — it handles the case where the CQ is empty and WaitCQE
is blocked in epoll_wait. The regression test is
TestCancelObservedWithCQEReady in ublk/uring/uring_test.go; do not
remove it if you refactor WaitCQE.
Device.shutdown() ordering matters:
w.ioRing.Cancel()on each worker (eventfd wake of blockedWaitCQE). This is a main-goroutine operation, safe because the worker hasn't closed the ring yet.wg.Wait()— happens-before barrier; workers have exitedrun()and will not touch any shared state thereafter.- For each worker:
w.cleanup()to munmapioDescsandClose()the ring. Done from main goroutine, so ring state writes don't race with reads inCancel(). close(charFD)— triggersublk_ch_releasein the kernel, aborting any stale ublk_io state sodelDev()won't block on in-flight IOs.stopDev(),delDev().- Close control ring, close
ctrlFD.
The old version interleaved these steps and race-detected between
worker.cleanup / Device.shutdown / Ring.Cancel / Ring.Close. Do
not refactor the order without rerunning make test-integration under
-race — the kernel doesn't enforce this and bugs are stochastic.
- Integration tests live in
ublk/ublk_integration_test.gobehind//go:build integration. The file'sTestMainhard-fails (not skips) if not run as root or ifublk_drvis missing. Don't reintroducet.Skipfor these — the user explicitly wants failure, not silence. - golangci-lint must know about the tag or it flags
memBackend.snapshotas unused. Set viarun.build-tags: [integration]in.golangci.yml. - gopls / editors don't read
.golangci.yml. The portable fix isgo env -w GOFLAGS=-tags=integration. For VSCode specifically,.vscode/settings.jsonhasgopls.build.buildFlags— but.vscode/is.gitignored, so don't rely on committing it.
make cover produces coverage/unit.out + coverage/integration.out.
make cover-html opens the integration profile in a browser.
CI splits coverage collection across three jobs:
test-unit(amd64+arm64): runs unit tests, uploadsunit.outandunit.htmlas thecoverage-unitartifact.test-integration(amd64+arm64): runs integration tests, uploadsintegration.outandintegration.htmlascoverage-integration.coverage(amd64 only,needs: [test-unit, test-integration]): downloads both artifacts, merges them withgocovmerge, uploadscombined.out/combined.htmlascoverage-combined.
Every run page therefore has three separate artifact bundles plus
a ## Combined coverage block in the step summary showing the merged
number.
On pushes to main, the coverage job generates a Shields.io
endpoint JSON file (badge/coverage.json) and publishes it to the
gh-pages branch via peaceiris/actions-gh-pages. The README badge
points at https://img.shields.io/endpoint?url=https://e2b-dev.github.io/ublk-go/badge/coverage.json.
No external coverage service or token is required — GITHUB_TOKEN
handles the push to gh-pages.
Bare unit tests alone give ~33% coverage because most of the library needs root + ublk_drv loaded to exercise. The integration test binary pushes the total near ~80% once both profiles are combined.
Three knobs to set for any deployment creating more than a handful of devices at once (target workload is "a few hundred"):
-
ublks_max(kernel module parameter) — default 64, raise to 4096 viaetc/ublk.conf→/etc/modprobe.d/ublk.conf. The counter is bumped by everyUBLK_CMD_ADD_DEVregardless of caller privileges (the module description's "unprivileged" wording is misleading — the check is global). Hitting it surfaces asEACCES. -
udev CHANGE-event inotify watching — default on, turn off via
etc/97-ublk-device.rules→/etc/udev/rules.d/. Same policy NBD uses. Safe to skip, wasteful under heavy I/O. -
RLIMIT_NOFILEon the process using the library — default 1024–4096 on most distros. Eachublk.Deviceholds 3 fds internally (control, char, io_uring) plus any user fd on/dev/ublkbN. 500 devices ≈ 1500+ fds. Raise viaulimit -n 65536in the shell orLimitNOFILE=65536in the systemd unit.
Reload ublk_drv and udev without rebooting:
sudo rmmod ublk_drv && sudo modprobe ublk_drv
sudo udevadm control --reload-rules && sudo udevadm triggerVerify:
cat /sys/module/ublk_drv/parameters/ublks_max # 4096
ulimit -n # 65536+ubuntu-24.04runner has Go 1.25.8 preinstalled. The workflow passesgo-version: "1.25"+check-latest: falsesosetup-gomatches the preinstalled version instead of fetching 1.25.6 from scratch (~10s saved).go.mod'sgo 1.25.0directive is the canonical form.go mod tidyrewritesgo 1.25→go 1.25.0; don't commit the short form or thelint-tidystep will fail.golangci-lintis installed from the prebuilt tarball via the project's owninstall.sh, pinned by tag (v2.11.4currently). Going viago installcompiles from source and takes ~107s instead of ~3s.actions/*@<oldpin>older than Feb 2025 hit the retired Actions Cache v1 service and logCache service responded with 400. If you see that warning reappear, bump the pin.
queueDepth = 128(inublk.go),maxSectors = 256(128 KiB max I/O). These are hard-coded; changing them means re-running the full integration test because buffer sizing and kernel param struct depend on the values.maxQueueDepth = 4096(intypes.go) is the kernel'sUBLK_MAX_QUEUE_DEPTHconstant, used only for the mmap offset calculation when mapping IO descriptors from the char device. It does not affect the actual queue depth.- The backend is called with a
[]byteslice whose length already reflects the logical IO size (nr_sectors * 512). Don't re-clip.
When poking the mount from another terminal, writes to the page cache
are not visible to Backend.WriteAt until either:
sync -f <mountpoint>(or anfsync(2)on any fd there), or- the kernel's periodic flush (
/proc/sys/vm/dirty_expire_centisecs, default 3000 = 30s).
Plain sync(1) syncs every mount on the host, so it can look "stuck"
for a long time on a busy system even when nothing in ublk is wrong.
Always prefer sync -f.
drop_caches=3 does not flush dirty pages (contrary to folklore —
the kernel just drops what's already clean; see fs/drop_caches.c).
If a drop_caches call appears to take several seconds, what's
actually happening is the kernel's bdi writeback thread firing on
its own timer (/proc/sys/vm/dirty_writeback_centisecs, default 500
cs = 5 s) during the same wall-clock window, and the backend sees those
writes attributed to the drop_caches step by a naive benchmark.
The practical fix is to sync -f <mountpoint> before any call that
requires a clean filesystem — then drop_caches runs in ~150 ms and
no background writeback interferes.
make flushbench empirically confirms: max gap between consecutive
backend calls while our stack is active is ≤4.3 ms. Seconds-level
stalls always attribute to kernel writeback timing, not our code.
CodeQL extractor only scans files with default build tags. The 3 it
misses are the //go:build integration test and the two example/
main.go packages (they live in different package main roots).
That's informational, not a failure.
GitHub's Default Code Scanning setup and our codeql.yml advanced
workflow are mutually exclusive. If both are enabled, advanced runs fail
with Resource not accessible by integration when uploading SARIF (the
default setup owns that endpoint). Toggle one off in repo settings.
When cross-checking correctness or looking for features/optimizations, these are the canonical implementations to compare against. They were last audited in April 2026.
-
axboe/liburing — The canonical C library for io_uring, maintained by Jens Axboe (io_uring author). Gold standard for memory barrier placement. Key files:
src/queue.c(SQ flush, CQ read,sq_ring_needs_enter),src/include/liburing/barrier.h(barrier primitives).Key learnings from auditing liburing:
- SQ tail store uses release semantics when SQPOLL is set, plain
store otherwise. Our
atomic.StoreUint32is release (more conservative but correct). - Full
smp_mb()barrier required between SQ tail write and readingIORING_SQ_NEED_WAKEUPflag. Only relevant for SQPOLL mode (we don't use it). See liburing issue #541 / commit 744f4156b25d. - CQ head/tail reads use relaxed loads (
READ_ONCE). Our atomic loads are acquire (more conservative but correct). - sqArray is pre-populated with identity mapping at setup time and never written during flush. We write it every flush (harmless but wasteful; see TODO.md "Pre-populate SQ array at setup").
- SQ tail store uses release semantics when SQPOLL is set, plain
store otherwise. Our
-
tokio-rs/io-uring — Rust io_uring bindings. Key files:
src/squeue.rs,src/cqueue.rs,src/submit.rs.Key issue: #197 — SeqCst fence race with SQPOLL. Without a
fence(SeqCst)between writing SQ tail and reading the NEED_WAKEUP flag, the CPU can reorder operations causing a deadlock where work is submitted but the kernel poll thread never wakes. Fixed by addingatomic::fence(SeqCst)beforesq_need_wakeup(). Not relevant to us (no SQPOLL) but documents why the barrier is needed.Key issue: #302 — CQ overflow with
IORING_SETUP_CQ_NODROP. When CQ is full and NODROP is set, the kernel backs up entries. Flushing them requiresio_uring_enterwith GETEVENTS. We don't use NODROP, so overflow would silently drop CQEs. See TODO.md "CQ overflow detection".
-
ublk-org/ublksrv — The C reference implementation maintained by Ming Lei (ublk kernel driver author). Key files:
lib/ublksrv.c(IO loop, queue management),lib/ublksrv_cmd.c(control path).Key patterns:
- Single-threaded per-queue design: one pthread per queue, no locks
on the hot path. Our Go equivalent is
runtime.LockOSThread()per worker goroutine. - IO loop uses
io_uring_submit_and_wait_timeout()— single syscall for submit + wait. We use two syscalls (Submit + epoll_wait). See TODO.md "Single-syscall IO loop". - Idle detection: 20-second timeout →
madvise(MADV_DONTNEED)on buffers. See TODO.md "Idle buffer page discard". - Batch IO mode (
UBLK_F_BATCH_IO): multishot fetch + double- buffered commits. See TODO.md "Batch IO mode". - Issue #173: EINVAL on kernel 6.15+ from DevID mismatch between
devInfoandctrlCmd. We already handle this correctly (both set to^uint32(0)). - Issue #63: "can't get sqe" on misconfigured liburing (needs
IORING_SETUP_SQE128). We handle this (control ring usesNewSQE128).
- Single-threaded per-queue design: one pthread per queue, no locks
on the hot path. Our Go equivalent is
-
ublk-org/rublk — Rust ublk implementation (depends on libublk-rs crate). Key files:
src/loop.rs,src/null.rs,src/qcow2.rs.Key learnings:
- PR #15 (open): Replace
unwrap()panics in IO path with error codes. Panicking in the IO handler crashes the daemon and leaves the kernel device hanging forever. Our code correctly returns-EIOfromhandleIO— we never panic. - PR #14: Timeout CQEs mistakenly treated as IO task wakeups. Demonstrates the importance of distinguishing internal events from IO completions. Not relevant to us (we don't use timeouts in the worker loop).
- PR #13: Batch eventfd notifications with a counter to reduce syscalls. Our eventfd usage is limited to cancellation (one-shot), so this optimization doesn't apply.
- PR #7: FLUSH operations in zero-copy mode need special handling. Will be relevant when we implement Flusher + zero-copy.
- PR #15 (open): Replace
Our implementation was cross-referenced against all four repos. No correctness bugs were found. The implementation correctly handles:
- Memory barriers for SQ/CQ ring management (Go's
atomicpackage provides release/acquire semantics on all platforms including ARM64) - DevID matching for kernel 6.17+ (
^uint32(0)in both devInfo and ctrlCmd) - FETCH_REQ task work processing (initial SubmitAndWait with GETEVENTS)
- COMMIT_AND_FETCH without GETEVENTS in the worker loop (works because
we don't set
IORING_SETUP_DEFER_TASKRUN— task work runs during syscall return) - Error codes returned to kernel (no panics in IO path)
- Shutdown sequencing (cancel → wait → cleanup → close charFD → stop → del)
Areas where we are more conservative than necessary (harmless):
atomic.StoreUint32for sqTail (release) — liburing uses plain store without SQPOLLatomic.LoadUint32for cqHead/cqTail (acquire) — liburing uses relaxed loads- sqArray written every flush — liburing writes once at setup