Skip to content

Latest commit

 

History

History
514 lines (417 loc) · 24 KB

File metadata and controls

514 lines (417 loc) · 24 KB

AGENTS.md

Running log of non-obvious things about this codebase for future humans and AI agents working in this repo. Append entries; don't rewrite history. Keep it factual. Read this before diving in.

How to verify the whole stack autonomously

# The core integration suite — probe + torture are now Go tests in
# ublk/integration_*_test.go, invoked by make test-integration.
make test-integration

# Standalone harnesses (still example/* binaries for now):
make chain           # sudo needed; stacks two ublks (proxy -> storage)
make stress          # sudo needed; race-detector stress (create/close churn, IO-while-close, etc.)
make fault           # sudo needed; Backend returns EIO; verifies errors propagate to userspace
make sigkill         # sudo needed; child process killed mid-I/O; verifies kernel cleanup
make flushbench      # sudo needed; microsecond trace of backend calls during flush operations

# Torture long soak:  UBLK_TORTURE_DURATION=30m go test -tags=integration -run TestTortureRandomIO ...

The probe (example/probe/main.go) exercises both sides of the stack:

  • Device-level (direct I/O, bypasses page cache): BLKGETSIZE64 size check; pre-mkfs zero-read; random-block write/read roundtrip that also verifies the backend's raw storage holds the same bytes at the same offset (proves kernel ↔ userspace offset mapping is 1:1).
  • Filesystem-level: mkfs.ext4 → mount → scripted write + sync -f (asserts backend writes > 0) → fsync alone (also asserts backend writes > 0) → drop caches + readback (asserts backend reads > 0) → scan the backend for the magic pattern (proves filesystem reads ultimately come from our in-memory storage) → concurrent writers → remount (journal replay) → umount → close → verify /dev/ublkbN gone.

If a step hangs beyond the timeout the probe panics, which prints a full goroutine dump from the Go runtime — this is the single most useful artifact when diagnosing ublk-level stalls, because it tells you whether the worker is blocked in WaitCQE, inside Backend.*, or elsewhere.

make chain (example/chain/main.go) creates two ublk devices in the same process: a storage ublk with an in-memory backend, then a proxy ublk whose Backend forwards Pread/Pwrite calls to the storage's block device (opened O_DIRECT). I/O written to the proxy's block device must appear byte-for-byte at the same offset in the storage's in-memory backend. This validates two complete ublk stacks running side-by-side, two LockOSThread'd workers, and cross-device data integrity. If this test passes, composition works.

make torture (example/torture/main.go) is the fuzz-style integrity test. Each of N worker goroutines owns a disjoint region of the device; each picks a random (offset, length) inside its region and a random direction (read or write); on write it updates an in-memory shadow of what the device should contain; on read it compares the result against the shadow and fails the run (non-zero exit, with first-differing byte offset) on any mismatch. Periodic fsync and full-region reverify runs exercise the write-through and journaling paths. Run for minutes, not seconds, to find subtle ordering bugs.

make fault (example/fault/main.go) injects backend errors at a configurable rate and checks they propagate all the way up to Pwrite/Pread on /dev/ublkbN. The scenarios cover low-rate failures (10%), total write/read failure (100%), and the often-forgotten "Close() with pending errors" case — which must not hang.

make sigkill (example/sigkill/main.go) spawns a child process, kills it with SIGKILL mid-I/O, and verifies the kernel's own cleanup path (ublk_ch_release on fd close) is sufficient to remove the device nodes and free whatever state the kernel holds. The parent then creates a fresh device to confirm no leak. Matters because SIGKILL bypasses every Go-level cleanup (defer, sync.Once, etc.) — the kernel is the only thing protecting us.

make stress (example/stress/main.go) runs four stressors against -race-instrumented library code:

  • churn — tight New→small-I/O→Close loop, catches leaks and shutdown-order races.
  • ioWhileClose — I/O goroutines hammer the block device; Close() mid-stream. Catches races between worker cleanup and in-flight I/O.
  • concurrentClose — N goroutines call Close() at once. Confirms the sync.Once guard is sufficient.
  • many — N devices alive simultaneously with writer goroutines, closed in parallel. Catches cross-device state bleed.

Any race-detector warning fails the run (non-zero exit). Run for longer (-duration 5m) before a release or after touching shutdown code.

Other diagnostic commands:

pgrep -af 'example/probe' | awk '{print $1}' | xargs -r sudo kill -SIGQUIT   # manual stack dump
cat /sys/class/block/ublkb*/stat                                              # block stats
sudo dmesg | tail -40                                                         # kernel messages (ublk_drv logs here)

Kernel ABI landmines (UAPI, current kernels 6.13+)

  • devInfo.DevID must match ctrlCmd.DevID (kernel 6.17+ validation). We set both to ^uint32(0) to request auto-assign. Previous code only set it in the ctrlCmd, which started returning EINVAL on 6.17.
  • ADD_DEV has two encodings. The ioctl-encoded command (uCmdAddDev) is newer; the legacy cmdAddDev is tried as fallback. Expect ENOTSUP from the legacy path on modern kernels — that's normal, just means the first path succeeded.
  • FETCH_REQ is processed as deferred task work starting around 6.13. It only completes when the io_uring is entered with IORING_ENTER_GETEVENTS. That is why worker.run() submits via SubmitAndWait() (which passes that flag), not Submit(). Using plain Submit() leaves START_DEV hanging in the kernel waiting for the fetch to complete.
  • Control ring uses SQE128 (for URING_CMD passthrough of ublksrv_ctrl_cmd, which sits in sqe->cmd at offset 48). Data ring uses SQE64 and packs the 16-byte ublksrv_io_cmd into the trailing Cmd field.

Worker-goroutine discipline

  • Each worker must call runtime.LockOSThread() before its first io_uring_enter. ublk binds IO credentials to the thread that first submitted FETCH_REQ. If a goroutine gets migrated between threads, subsequent submissions fail or go to the wrong queue.
  • FETCH_REQ must be submitted before START_DEV is issued (kernel blocks START_DEV until the fetches arrive). The worker signals readiness through a channel after its first SubmitAndWait() so the main goroutine can proceed to START_DEV. See the comments in worker.run().

Close the block-device fds before calling Device.Close

The library's Device.Close issues UBLK_CMD_DEL_DEV, which is fundamentally del_gendisk() inside the kernel. del_gendisk() blocks until all open fds to /dev/ublkbN are released — this is standard block-device teardown semantics, not a ublk quirk.

So: if a user opens /dev/ublkbN for their own I/O, they must unix.Close(fd) before calling dev.Close(). Otherwise dev.Close hangs indefinitely waiting for del_gendisk.

This almost took us out twice — once as "sync got stuck during fsdemo", once as "ioWhileClose hangs". The fix in our test harnesses is to close user fds first, then call Device.Close. For users of the library, this needs to be documented clearly (README API section is a good place; not yet done).

If someone wants a "force close even with open fds" behaviour in the library, the options are: (a) have the library track user fds — a major API change and leaky abstraction; (b) switch to UBLK_CMD_DEL_DEV_ASYNC (kernel 6.1+), which marks the device for deletion but returns immediately — /dev/ublkbN disappears later when the refcount drops. The async variant is a better default but changes Close's semantics (it no longer guarantees the node is gone on return). Not implemented; worth considering if users hit this.

Clean Close always works. What leaves orphan device nodes is ungraceful termination where Device.Close() is never called:

  • Ctrl+C on a harness that doesn't trap SIGINT: Go's default handler exits without running defers. The long-running harnesses (stress, torture, flushbench) now trap SIGINT/SIGTERM and return cleanly, so defer dev.Close() runs. Short-running ones (probe, chain, fault) don't need it — they're too fast to matter in practice.
  • SIGKILL or crash: deliberately tested by make sigkill. The kernel must clean up on its own via ublk_ch_release on fd close. On kernel 6.17.0 that path can wedge processes in D state, leaving orphan nodes until reboot. Unknown whether a specific upstream fix exists; ublk was heavily refactored Sept 2025 and 6.18 stable has multiple fixes. TODO.md tracks.

The library-correctness question "can we keep using the API after an ungraceful kill" is distinct from and answerable: make sigkill verifies that attempting ublk.New in the parent after the child is SIGKILL'd succeeds (observed: 11 ms on kernel 6.17.0). The kernel allocates new minor numbers independently of whether old ones have been reclaimed, until the ublks_max limit (default 64) is hit.

If you accumulate stale nodes, reboot — no userspace cleanup works reliably.

Diagnostic commands when investigating this

# how many ublk minors are consumed vs the kernel's limit
ls /sys/class/ublk-char/ | wc -l
cat /sys/module/ublk_drv/parameters/ublks_max

# module refcount (grows over a session, never shrinks if devices leak)
lsmod | awk '$1 == "ublk_drv"'

# find the stuck processes — they'll be in D (uninterruptible) state
ps -eo pid,state,cmd | awk '$2 == "D"' | grep -i ublk

# what kernel routine is a stuck process waiting on
sudo cat /proc/<PID>/stack
cat /proc/<PID>/wchan

# recent ublk messages
journalctl -k --since '5 minutes ago' | grep -i ublk

If /proc/<PID>/stack shows frames inside ublk_ch_release or blk_mq_quiesce_queue or similar, that's the smoking gun for the kernel-side hang. Take the output, report to linux-block mailing list (and cc: Ming Lei — ming.lei@redhat.com, the driver maintainer), and reboot the machine.

See TODO.md for a planned upstream repro.

Ring.Cancel must be observable from the busy path

Ring.Cancel() uses an eventfd/epoll wakeup to break a blocked WaitCQE. But WaitCQE has a fast-path that returns an already-queued CQE without ever calling epoll_wait — and under sustained I/O pressure the CQ is always non-empty when the worker re-enters WaitCQE. Without an additional cancel-flag check the worker never observes the cancel signal and Device.shutdown hangs forever.

Fix (current): Ring.cancelled (atomic.Bool) set by Cancel(), checked at the top of every WaitCQE iteration. The eventfd+epoll setup stays — it handles the case where the CQ is empty and WaitCQE is blocked in epoll_wait. The regression test is TestCancelObservedWithCQEReady in ublk/uring/uring_test.go; do not remove it if you refactor WaitCQE.

Shutdown sequencing (current, post data-race fixes)

Device.shutdown() ordering matters:

  1. w.ioRing.Cancel() on each worker (eventfd wake of blocked WaitCQE). This is a main-goroutine operation, safe because the worker hasn't closed the ring yet.
  2. wg.Wait() — happens-before barrier; workers have exited run() and will not touch any shared state thereafter.
  3. For each worker: w.cleanup() to munmap ioDescs and Close() the ring. Done from main goroutine, so ring state writes don't race with reads in Cancel().
  4. close(charFD) — triggers ublk_ch_release in the kernel, aborting any stale ublk_io state so delDev() won't block on in-flight IOs.
  5. stopDev(), delDev().
  6. Close control ring, close ctrlFD.

The old version interleaved these steps and race-detected between worker.cleanup / Device.shutdown / Ring.Cancel / Ring.Close. Do not refactor the order without rerunning make test-integration under -race — the kernel doesn't enforce this and bugs are stochastic.

Build tags and tooling

  • Integration tests live in ublk/ublk_integration_test.go behind //go:build integration. The file's TestMain hard-fails (not skips) if not run as root or if ublk_drv is missing. Don't reintroduce t.Skip for these — the user explicitly wants failure, not silence.
  • golangci-lint must know about the tag or it flags memBackend.snapshot as unused. Set via run.build-tags: [integration] in .golangci.yml.
  • gopls / editors don't read .golangci.yml. The portable fix is go env -w GOFLAGS=-tags=integration. For VSCode specifically, .vscode/settings.json has gopls.build.buildFlags — but .vscode/ is .gitignored, so don't rely on committing it.

Coverage

make cover produces coverage/unit.out + coverage/integration.out. make cover-html opens the integration profile in a browser.

CI splits coverage collection across three jobs:

  • test-unit (amd64+arm64): runs unit tests, uploads unit.out and unit.html as the coverage-unit artifact.
  • test-integration (amd64+arm64): runs integration tests, uploads integration.out and integration.html as coverage-integration.
  • coverage (amd64 only, needs: [test-unit, test-integration]): downloads both artifacts, merges them with gocovmerge, uploads combined.out/combined.html as coverage-combined.

Every run page therefore has three separate artifact bundles plus a ## Combined coverage block in the step summary showing the merged number.

On pushes to main, the coverage job generates a Shields.io endpoint JSON file (badge/coverage.json) and publishes it to the gh-pages branch via peaceiris/actions-gh-pages. The README badge points at https://img.shields.io/endpoint?url=https://e2b-dev.github.io/ublk-go/badge/coverage.json. No external coverage service or token is required — GITHUB_TOKEN handles the push to gh-pages.

Bare unit tests alone give ~33% coverage because most of the library needs root + ublk_drv loaded to exercise. The integration test binary pushes the total near ~80% once both profiles are combined.

Production / self-hosted-runner setup

Three knobs to set for any deployment creating more than a handful of devices at once (target workload is "a few hundred"):

  1. ublks_max (kernel module parameter) — default 64, raise to 4096 via etc/ublk.conf/etc/modprobe.d/ublk.conf. The counter is bumped by every UBLK_CMD_ADD_DEV regardless of caller privileges (the module description's "unprivileged" wording is misleading — the check is global). Hitting it surfaces as EACCES.

  2. udev CHANGE-event inotify watching — default on, turn off via etc/97-ublk-device.rules/etc/udev/rules.d/. Same policy NBD uses. Safe to skip, wasteful under heavy I/O.

  3. RLIMIT_NOFILE on the process using the library — default 1024–4096 on most distros. Each ublk.Device holds 3 fds internally (control, char, io_uring) plus any user fd on /dev/ublkbN. 500 devices ≈ 1500+ fds. Raise via ulimit -n 65536 in the shell or LimitNOFILE=65536 in the systemd unit.

Reload ublk_drv and udev without rebooting:

sudo rmmod ublk_drv && sudo modprobe ublk_drv
sudo udevadm control --reload-rules && sudo udevadm trigger

Verify:

cat /sys/module/ublk_drv/parameters/ublks_max   # 4096
ulimit -n                                       # 65536+

CI specifics

  • ubuntu-24.04 runner has Go 1.25.8 preinstalled. The workflow passes go-version: "1.25" + check-latest: false so setup-go matches the preinstalled version instead of fetching 1.25.6 from scratch (~10s saved).
  • go.mod's go 1.25.0 directive is the canonical form. go mod tidy rewrites go 1.25go 1.25.0; don't commit the short form or the lint-tidy step will fail.
  • golangci-lint is installed from the prebuilt tarball via the project's own install.sh, pinned by tag (v2.11.4 currently). Going via go install compiles from source and takes ~107s instead of ~3s.
  • actions/*@<oldpin> older than Feb 2025 hit the retired Actions Cache v1 service and log Cache service responded with 400. If you see that warning reappear, bump the pin.

Data-plane details

  • queueDepth = 128 (in ublk.go), maxSectors = 256 (128 KiB max I/O). These are hard-coded; changing them means re-running the full integration test because buffer sizing and kernel param struct depend on the values.
  • maxQueueDepth = 4096 (in types.go) is the kernel's UBLK_MAX_QUEUE_DEPTH constant, used only for the mmap offset calculation when mapping IO descriptors from the char device. It does not affect the actual queue depth.
  • The backend is called with a []byte slice whose length already reflects the logical IO size (nr_sectors * 512). Don't re-clip.

Known observations

ext4 + page cache timing

When poking the mount from another terminal, writes to the page cache are not visible to Backend.WriteAt until either:

  • sync -f <mountpoint> (or an fsync(2) on any fd there), or
  • the kernel's periodic flush (/proc/sys/vm/dirty_expire_centisecs, default 3000 = 30s).

Plain sync(1) syncs every mount on the host, so it can look "stuck" for a long time on a busy system even when nothing in ublk is wrong. Always prefer sync -f.

drop_caches latency is kernel-side, not ours

drop_caches=3 does not flush dirty pages (contrary to folklore — the kernel just drops what's already clean; see fs/drop_caches.c). If a drop_caches call appears to take several seconds, what's actually happening is the kernel's bdi writeback thread firing on its own timer (/proc/sys/vm/dirty_writeback_centisecs, default 500 cs = 5 s) during the same wall-clock window, and the backend sees those writes attributed to the drop_caches step by a naive benchmark.

The practical fix is to sync -f <mountpoint> before any call that requires a clean filesystem — then drop_caches runs in ~150 ms and no background writeback interferes.

make flushbench empirically confirms: max gap between consecutive backend calls while our stack is active is ≤4.3 ms. Seconds-level stalls always attribute to kernel writeback timing, not our code.

"scanned 6 out of 9 Go files" in CodeQL

CodeQL extractor only scans files with default build tags. The 3 it misses are the //go:build integration test and the two example/ main.go packages (they live in different package main roots). That's informational, not a failure.

Default Code Scanning vs. advanced workflow

GitHub's Default Code Scanning setup and our codeql.yml advanced workflow are mutually exclusive. If both are enabled, advanced runs fail with Resource not accessible by integration when uploading SARIF (the default setup owns that endpoint). Toggle one off in repo settings.

Reference implementations

When cross-checking correctness or looking for features/optimizations, these are the canonical implementations to compare against. They were last audited in April 2026.

io_uring userspace libraries

  • axboe/liburing — The canonical C library for io_uring, maintained by Jens Axboe (io_uring author). Gold standard for memory barrier placement. Key files: src/queue.c (SQ flush, CQ read, sq_ring_needs_enter), src/include/liburing/barrier.h (barrier primitives).

    Key learnings from auditing liburing:

    • SQ tail store uses release semantics when SQPOLL is set, plain store otherwise. Our atomic.StoreUint32 is release (more conservative but correct).
    • Full smp_mb() barrier required between SQ tail write and reading IORING_SQ_NEED_WAKEUP flag. Only relevant for SQPOLL mode (we don't use it). See liburing issue #541 / commit 744f4156b25d.
    • CQ head/tail reads use relaxed loads (READ_ONCE). Our atomic loads are acquire (more conservative but correct).
    • sqArray is pre-populated with identity mapping at setup time and never written during flush. We write it every flush (harmless but wasteful; see TODO.md "Pre-populate SQ array at setup").
  • tokio-rs/io-uring — Rust io_uring bindings. Key files: src/squeue.rs, src/cqueue.rs, src/submit.rs.

    Key issue: #197 — SeqCst fence race with SQPOLL. Without a fence(SeqCst) between writing SQ tail and reading the NEED_WAKEUP flag, the CPU can reorder operations causing a deadlock where work is submitted but the kernel poll thread never wakes. Fixed by adding atomic::fence(SeqCst) before sq_need_wakeup(). Not relevant to us (no SQPOLL) but documents why the barrier is needed.

    Key issue: #302 — CQ overflow with IORING_SETUP_CQ_NODROP. When CQ is full and NODROP is set, the kernel backs up entries. Flushing them requires io_uring_enter with GETEVENTS. We don't use NODROP, so overflow would silently drop CQEs. See TODO.md "CQ overflow detection".

ublk userspace servers

  • ublk-org/ublksrv — The C reference implementation maintained by Ming Lei (ublk kernel driver author). Key files: lib/ublksrv.c (IO loop, queue management), lib/ublksrv_cmd.c (control path).

    Key patterns:

    • Single-threaded per-queue design: one pthread per queue, no locks on the hot path. Our Go equivalent is runtime.LockOSThread() per worker goroutine.
    • IO loop uses io_uring_submit_and_wait_timeout() — single syscall for submit + wait. We use two syscalls (Submit + epoll_wait). See TODO.md "Single-syscall IO loop".
    • Idle detection: 20-second timeout → madvise(MADV_DONTNEED) on buffers. See TODO.md "Idle buffer page discard".
    • Batch IO mode (UBLK_F_BATCH_IO): multishot fetch + double- buffered commits. See TODO.md "Batch IO mode".
    • Issue #173: EINVAL on kernel 6.15+ from DevID mismatch between devInfo and ctrlCmd. We already handle this correctly (both set to ^uint32(0)).
    • Issue #63: "can't get sqe" on misconfigured liburing (needs IORING_SETUP_SQE128). We handle this (control ring uses NewSQE128).
  • ublk-org/rublk — Rust ublk implementation (depends on libublk-rs crate). Key files: src/loop.rs, src/null.rs, src/qcow2.rs.

    Key learnings:

    • PR #15 (open): Replace unwrap() panics in IO path with error codes. Panicking in the IO handler crashes the daemon and leaves the kernel device hanging forever. Our code correctly returns -EIO from handleIO — we never panic.
    • PR #14: Timeout CQEs mistakenly treated as IO task wakeups. Demonstrates the importance of distinguishing internal events from IO completions. Not relevant to us (we don't use timeouts in the worker loop).
    • PR #13: Batch eventfd notifications with a counter to reduce syscalls. Our eventfd usage is limited to cancellation (one-shot), so this optimization doesn't apply.
    • PR #7: FLUSH operations in zero-copy mode need special handling. Will be relevant when we implement Flusher + zero-copy.

Cross-reference audit results (April 2026)

Our implementation was cross-referenced against all four repos. No correctness bugs were found. The implementation correctly handles:

  • Memory barriers for SQ/CQ ring management (Go's atomic package provides release/acquire semantics on all platforms including ARM64)
  • DevID matching for kernel 6.17+ (^uint32(0) in both devInfo and ctrlCmd)
  • FETCH_REQ task work processing (initial SubmitAndWait with GETEVENTS)
  • COMMIT_AND_FETCH without GETEVENTS in the worker loop (works because we don't set IORING_SETUP_DEFER_TASKRUN — task work runs during syscall return)
  • Error codes returned to kernel (no panics in IO path)
  • Shutdown sequencing (cancel → wait → cleanup → close charFD → stop → del)

Areas where we are more conservative than necessary (harmless):

  • atomic.StoreUint32 for sqTail (release) — liburing uses plain store without SQPOLL
  • atomic.LoadUint32 for cqHead/cqTail (acquire) — liburing uses relaxed loads
  • sqArray written every flush — liburing writes once at setup