Skip to content

# block: Crate Refactoring - Cycle 3: Async Writes & Completion #8033

@weltling

Description

@weltling

Tracking Cycle 3 of the block crate refactoring.
See #7560 for the overall strategy discussion,
#7694 for Cycle 1, and
#7877 for Cycle 2.

RFC: https://github.com/weltling/cloud-hypervisor/blob/rfc-block-refactoring/docs/rfcs/block-refactoring-async-early.md

Background

Cycle 1 established the QCOW2 foundation:

  • BlockError unified error type (block/src/error.rs)
  • Composable trait hierarchy (DiskFile, AsyncDiskFile in block/src/disk_file.rs)
  • QcowMetadata with coarse RwLock, ClusterReadMapping/ClusterWriteMapping
  • QcowDiskSync and QcowDiskAsync migrated to new traits
  • QcowAsync io_uring backend: single allocated cluster reads go through
    io_uring; multi mapping reads, compressed reads, backing file reads, and
    all writes fall back to synchronous I/O with synthetic completions

Cycle 2 expanded the proven patterns to all formats:

  • All format errors migrated to BlockError at the public API boundary
  • All formats implement disk_file::DiskFile / AsyncDiskFile
  • RawDisk and VhdDisk unified from duplicate structs
  • IoUringContext shared infrastructure in io/io_uring.rs
  • Factory pattern (open_disk_file) replaces manual construction in vmm/
  • DiskBackend transitional enum and async_io::DiskFile removed
  • Module reorganization under formats/ and io/

Cycle 3 completes the async story for QCOW2 (writes, compressed reads,
multi mapping reads, async fsync) and extends async support to VHDx, the
last format without an async I/O path.

Phase 4: Async Writes & Completion

4.1 QCOW2 Async Writes

QcowAsync currently falls back to synchronous cow_write_sync() with
synthetic completions for all writes. This section replaces the sync
fallback with a true async write path using io_uring.

Write complexity by case (simplest to hardest):

  1. Already allocated cluster: map_cluster_for_write returns
    Allocated { offset } with no metadata mutation needed. Direct
    io_uring Writev at host offset. One async operation.
  2. Unallocated cluster, no backing file: metadata allocates a fresh
    cluster (L2 update + refcount) under a write lock, then returns
    Allocated { offset } for the new cluster. io_uring Writev to
    the new offset. One async operation; metadata completes before
    submission.
  3. Unallocated cluster with backing file (COW): read original data
    from backing file, merge guest write data into cluster buffer,
    allocate new cluster in metadata, io_uring Writev merged data.
    Backing read is synchronous today (backing files expose
    BackingRead::read_at). The write itself goes through io_uring.

In all three cases map_cluster_for_write already handles cluster
allocation and L2/refcount updates under the metadata lock (Cycle 1
design). The caller receives a host offset and performs data I/O
without holding the lock. This means the io_uring submission path is
structurally the same as reads: resolve offset, submit, drain CQ.

The state machine complexity described in the RFC assumed metadata
mutation would need async I/O. Because QcowMetadata resolves all
metadata synchronously under a write lock and returns a plain host
offset, writes become one async operation per cluster region (after
the synchronous metadata step). The complexity therefore shifts to
multi cluster writes, where a single guest request spans several
cluster boundaries and produces multiple io_uring submissions that
must all complete before the guest request is signalled done.

4.1.1 through 4.1.3 are sequential (each extends the previous).
4.1.4 depends on 4.1.3. 4.1.5 is independent of 4.1.4 but requires
4.1.3. 4.1.6 is the synchronization point.

Task Assignee PR Depends on Notes
4.1.1 Async write to single allocated cluster TBD TBD Cycle 2 map_cluster_for_write returns Allocated. Submit io_uring Writev at host offset with user_data. Drain CQ in next_completed_request. Validates the write submission path in isolation. Keep cow_write_sync as fallback for remaining cases.
4.1.2 Async write to unallocated cluster (no backing) TBD TBD 4.1.1 Same io_uring path as 4.1.1; map_cluster_for_write allocates the cluster under a write lock before returning Allocated. No new async machinery needed. Test: write to fresh image with no backing file, read back, verify.
4.1.3 Multi cluster write with inflight tracking TBD TBD 4.1.2 A single guest write spanning N cluster boundaries produces N io_uring Writev submissions. Introduce a PendingWrite tracker that maps the original user_data to the set of sub operation user_data values. The guest completion fires only when all N sub operations complete in CQ. Partial failure: if any sub write fails, report error for the guest request.
4.1.4 COW write with backing file read TBD TBD 4.1.3 For partial cluster writes to unallocated clusters with a backing file: read original cluster data from backing via BackingRead::read_at (synchronous), merge guest bytes, submit io_uring Writev for the full cluster. The backing read is the only sync step; it completes before io_uring submission. The alternative (async backing read) is deferred because backing files may themselves be QCOW2.
4.1.5 Async fsync TBD TBD 4.1.3 Replace the synchronous metadata.flush() in QcowAsync::fsync with a two phase sequence: (1) metadata.flush() writes dirty L2 and refcount blocks synchronously (metadata I/O is small and infrequent), then (2) submit io_uring IORING_OP_FSYNC for the data file and wait for CQ completion. This separates the metadata flush (microseconds) from the data fdatasync (milliseconds) and makes the expensive part non blocking.
4.1.6 Async write tests TBD TBD 4.1.1 to 4.1.5 Write to allocated cluster. Write to unallocated (no backing). Spanning write across cluster boundary. COW write with backing file. Partial cluster write. Batch writes via submit_batch_requests. Async fsync. Concurrent writes from multiple queues. Error injection: SQ full, write I/O error. Read back verification after every write test.

Acceptance criteria for 4.1:

QcowAsync.write_vectored() submits data writes to io_uring instead of
falling back to cow_write_sync. Single cluster writes complete in one
async operation. Multi cluster writes track all sub operations and fire
the guest completion only when all finish. fsync submits
IORING_OP_FSYNC to io_uring for the data file. All existing QCOW2
write tests pass unchanged. No data corruption under concurrent multi
queue writes (verified by read back tests).

4.2 QCOW2 Async Multi Mapping Reads

QcowAsync::resolve_read currently falls back to scatter_read_sync
whenever map_clusters_for_read returns more than one mapping or returns
a mapping that is not a single allocated cluster (compressed, backing,
zero). This covers three distinct fallback cases. 4.2.1 handles the multi
mapping read; 4.2.2 handles compressed clusters; 4.2.3 handles backing
file reads. 4.2.4 is the synchronization point for tests.

Task Assignee PR Depends on Notes
4.2.1 Async multi mapping read with inflight tracking TBD TBD Cycle 2 When map_clusters_for_read returns N mappings, submit one io_uring Readv per Allocated mapping and handle Zero regions inline. Introduce PendingRead tracker analogous to PendingWrite from 4.1.3 (or generalize both into a shared PendingIo<T> type). Guest completion fires when all sub reads finish. Falls back to sync only for Compressed and Backing regions.
4.2.2 Async read of compressed clusters TBD TBD 4.2.1 Currently, ClusterReadMapping::Compressed { data } returns already decompressed data from inside the metadata lock. This is correct but holds the write lock during decompression. Option A (start here): keep current behavior; the async benefit is that other clusters in the same multi mapping read go through io_uring while compressed regions are resolved inline. Option B (if profiling warrants): read raw compressed bytes via io_uring, decompress on CQ completion, scatter to iovecs. Option B changes the ClusterReadMapping::Compressed variant to carry raw bytes + compression info and moves decompression out of the metadata lock.
4.2.3 Async backing file read TBD TBD 4.2.1 For ClusterReadMapping::Backing regions in multi mapping reads, read from the backing file. If the backing file's data fd is available and the backing format is RAW, submit io_uring Readv to the backing fd directly. Otherwise fall back to BackingRead::read_at. This makes the common case (RAW backing) fully async while complex cases (QCOW2 backing chain) remain sync.
4.2.4 Multi mapping read tests TBD TBD 4.2.1 to 4.2.3 Read spanning two allocated clusters. Read spanning allocated + zero. Read spanning allocated + backing. Read of compressed cluster. Mixed multi mapping read (allocated + compressed + backing in one request). Concurrent multi mapping reads from multiple queues. Performance comparison: multi mapping async vs sync fallback.

Acceptance criteria for 4.2:

Reads spanning multiple clusters submit per cluster io_uring operations
for Allocated regions instead of falling back to scatter_read_sync.
Zero regions are filled inline without I/O. Compressed and backing reads
use async I/O where feasible (RAW backing) or remain sync with clear
justification. Guest completion fires only after all sub reads finish.

4.3 VHDx Async Support

Cycle 2 migrated VhdxDiskSync to the new trait hierarchy but deferred
AsyncDiskFile implementation. VHDx follows the QCOW2 pattern: shared
metadata with interior mutability, per queue async I/O worker using
IoUringContext.

VHDx is simpler than QCOW2: no compression, no backing files, no COW.
The BAT (Block Allocation Table) maps virtual block addresses to host
file offsets. Metadata is read heavy and suits the same RwLock pattern
established by QcowMetadata.

4.3.1 and 4.3.2 are sequential. 4.3.3 depends on 4.3.2. 4.3.4 is the
synchronization point.

Task Assignee PR Depends on Notes
4.3.1 VhdxMetadata with interior mutability TBD TBD Cycle 2 Extract mutable state from Vhdx into VhdxMetadata with Arc<RwLock<>> for the BAT and region table. Add map_block_for_read(offset, len) -> Vec<BlockMapping> and map_block_for_write(offset, len) -> Vec<BlockMapping> returning host offsets (simpler than QCOW2: no compression, no backing chain). VhdxDiskSync uses VhdxMetadata internally. All existing VHDx tests pass.
4.3.2 Create VhdxDiskAsync + VhdxAsync TBD TBD 4.3.1 VhdxDiskAsync: device level handle, owns Arc<VhdxMetadata>. Implements DiskFile + AsyncDiskFile. VhdxAsync: per queue I/O worker using IoUringContext. Reads: map_block_for_read then io_uring Readv. Writes: map_block_for_write then io_uring Writev. Multi block spanning handled like 4.1.3/4.2.1 with PendingIo tracking.
4.3.3 Wire factory for VHDx async TBD TBD 4.3.2 Update factory.rs to return VhdxDiskAsync instead of VhdxDiskSync when io_uring is available and the caller requests async. Mirror the RAW/QCOW2/VHD factory selection logic.
4.3.4 VHDx async tests TBD TBD 4.3.2, 4.3.3 Async read, async write, concurrent multi queue access, logical_size and physical_size via &self, spanning read/write across block boundaries, factory selection (async when io_uring available, sync otherwise), error propagation with BlockError context.

Acceptance criteria for 4.3:

VhdxDiskAsync implements AsyncDiskFile. VhdxAsync worker uses
IoUringContext from io/io_uring.rs. Factory returns VhdxDiskAsync
when io_uring is available and format is VHDx. All existing VHDx tests
pass unchanged. No &mut self on read path methods.

4.4 Cleanup

Task Assignee PR Depends on Notes
4.4.1 Remove deprecated type aliases and re-exports TBD TBD 4.1 to 4.3 Remove all #[deprecated] aliases introduced in Cycle 2 task 3.4.1. grep -r "#\[deprecated\]" block/src/ should return nothing. Update any remaining consumers in virtio-devices/, vmm/, test crates to use canonical names.
4.4.2 Remove sync fallback code paths in QcowAsync TBD TBD 4.1, 4.2 After async writes (4.1) and async multi mapping reads (4.2), audit QcowAsync for remaining sync fallback paths. Remove dead code. If edge cases remain (e.g., QCOW2 backing chain reads), document why the fallback is required.
4.4.3 Audit public API surface TBD TBD 4.4.1 Review block/src/lib.rs re-exports. Only intentionally public items exported. Run cargo doc to verify no format-internal types leak beyond formats/*/internal/.

Acceptance criteria for 4.4:

No #[deprecated] attributes in block/. QcowAsync sync fallback
code is either removed or documented with justification.
cargo doc --document-private-items shows clean module boundaries.

4.5 Benchmarks & Documentation

Task Assignee PR Depends on Notes
4.5.1 Async write benchmarks TBD TBD 4.1 fio benchmarks: random write 4K, sequential write 128K, mixed read/write 70/30. Compare QCOW2 sync writes vs async writes. Compare against RAW async baseline. Measure fsync latency (the primary bottleneck identified in the RFC: 41.77ms QCOW2 vs 0.75ms host).
4.5.2 VHDx async benchmarks TBD TBD 4.3 Same fio workload as 4.5.1 for VHDx sync vs async. Baseline comparison.
4.5.3 End to end integration tests TBD TBD 4.1 to 4.3 Full VM boot with QCOW2 async (reads + writes). Full VM boot with VHDx async. Multi queue concurrent I/O. Extends Cycle 1 task 2.5 and Cycle 2 task 3.2.7 scope.
4.5.4 Module documentation TBD TBD 4.4 Add //! doc comments to every module in block/src/. Cover: purpose, key types, usage examples, relationship to other modules. Focus on formats/, io/, factory.rs, disk_file.rs, error.rs.
4.5.5 Architecture decision records TBD TBD 4.4 Document key design decisions: why metadata resolves synchronously under a lock rather than async; why sync decompression (Option A); why RwLock for L1/BAT + Mutex for LRU caches; IoUringContext shared infrastructure rationale; PendingIo tracker design.

Acceptance criteria for 4.5:

fio benchmarks show async write improvement over sync baseline.
fdatasync latency measurably improved with async fsync path. Integration
tests cover full VM lifecycle with async QCOW2 and VHDx. Every block/
module has //! documentation. Architecture decision records in docs/.

Success Criteria

Criterion Verification
QCOW2 writes go through io_uring cow_write_sync removed or only called for documented edge cases
Multi cluster writes tracked correctly Unit tests: spanning write, partial failure, completion ordering
Compressed reads handled without full sync fallback Compressed regions resolved inline or via async I/O depending on chosen option
Async fsync for data file grep "IORING_OP_FSYNC|opcode::Fsync" block/src/formats/qcow/ finds the submission
VHDx has async support VhdxDiskAsync implements AsyncDiskFile; factory returns it when io_uring available
No deprecated aliases remain grep -r "#\[deprecated\]" block/src/ returns nothing
Async write performance improvement fio random write 4K: async QCOW2 closer to RAW async baseline than sync QCOW2
fsync latency improvement fio fsync: measurable reduction from 41.77ms sync baseline
All existing tests pass CI green, no test modifications except additions
Documentation complete Every block/ module has //! doc; ADRs in docs/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions