GitHub Issue 초안 — raw_block bringup 시간 단축 (slot-header validation 병렬화)
검토용 초안. 등록 전 사용자 확인 필요. upstream LMCache/LMCache, blank issue 양식. 공동 목표 = restart/recovery bringup 시간 단축. 근거 노트:
docs/notes/raw_block_recovery_validation_parallel.md통합 합의안:recovery_validation_merge_proposal.md
Title: [Perf][RawBlock] Reduce restart bringup time by parallelizing recovery slot-header validation
Label performance / raw_block (LMCache Onboarding 카테고리에 맞게 — 등록 시 적절한 라벨 선택)
Summary
Restart/recovery bringup of the raw_block backend is slowed by validating
recovered slot headers one at a time. Parallelize the header reads (per I/O
engine) to cut bringup time on large devices with many indexed entries.
Details
On restart, RawBlockCore loads the latest checkpoint, rebuilds its in-memory
index, and then validates every recovered entry by reading that slot's on-device
header and checking its identity / payload length against the index metadata
(_validate_loaded_entries).
Today this validation reads one small (header-sized, e.g. 4 KiB) header per entry synchronously, one slot at a time:
for encoded_key, entry in items:
slot_hdr = self._read_slot_header(int(entry.offset)) # one sync read per slot
...
The fix is to issue these header reads in parallel using the mechanism native to each configured I/O engine, while keeping the validation logic and the on-device format unchanged:
- POSIX engine: read headers with a bounded internal pool of reader threads (entries split into contiguous ranges), exposing device parallelism without relying on high thread counts.
- io_uring engine: read all slot headers for a batch with a single
batched_read+wait_iouringsubmission instead of a per-slot read loop. This covers both regular io_uring and NVMe passthrough (use_uring_cmd), sincebatched_readsubmits io_uring_cmd batches in passthrough mode.
Only how the headers are read changes — the engine-specific part is confined to header reading, and a single validation loop (identity / payload-length match, drop-on-mismatch) is shared across engines. The on-device layout is untouched.
Steps / Reproduction (if applicable)
Reproducible with a synthetic bringup benchmark that prepares many indexed entries (slot headers only) and measures bringup = checkpoint-load + header-validation time:
- Prepare a checkpoint with a large number of indexed entries on the target device.
- Bring up
RawBlockCoreand time the recovery (checkpoint load + header validation). - Compare the serial path against the parallel path (and varying reader parallelism) — the gap grows with entry count.
Expected Outcome / Goal
Lower restart/recovery bringup latency on large raw-block devices, with the gap over the serial path widening as the indexed entry count grows. A reusable bringup benchmark quantifies and protects the improvement.
Actual Outcome (if applicable)
Header validation runs serially (one synchronous read per slot), so bringup latency grows linearly with the number of indexed entries and leaves device parallelism (threads for POSIX, NVMe queue depth for io_uring) unused.
Additional Context
- Work in progress as two complementary PRs (POSIX threadpool path + io_uring
batched_readpath, the latter also covering NVMe passthrough) plus the shared bringup benchmark; links will be added here as they open. - cc @DongDongJu (raw_block CODEOWNER)