GitHub Issue 초안 — raw_block bringup 시간 단축 (slot-header validation 병렬화)

검토용 초안. 등록 전 사용자 확인 필요. upstream LMCache/LMCache, blank issue 양식. 공동 목표 = restart/recovery bringup 시간 단축. 근거 노트: docs/notes/raw_block_recovery_validation_parallel.md 통합 합의안: recovery_validation_merge_proposal.md

Title: [Perf][RawBlock] Reduce restart bringup time by parallelizing recovery slot-header validation

Label performance / raw_block (LMCache Onboarding 카테고리에 맞게 — 등록 시 적절한 라벨 선택)

Summary

Restart/recovery bringup of the raw_block backend is slowed by validating recovered slot headers one at a time. Parallelize the header reads (per I/O engine) to cut bringup time on large devices with many indexed entries.

Details

On restart, RawBlockCore loads the latest checkpoint, rebuilds its in-memory index, and then validates every recovered entry by reading that slot's on-device header and checking its identity / payload length against the index metadata (_validate_loaded_entries).

Today this validation reads one small (header-sized, e.g. 4 KiB) header per entry synchronously, one slot at a time:

for encoded_key, entry in items:
    slot_hdr = self._read_slot_header(int(entry.offset))   # one sync read per slot
    ...

The fix is to issue these header reads in parallel using the mechanism native to each configured I/O engine, while keeping the validation logic and the on-device format unchanged:

POSIX engine: read headers with a bounded internal pool of reader threads (entries split into contiguous ranges), exposing device parallelism without relying on high thread counts.
io_uring engine: read all slot headers for a batch with a single batched_read + wait_iouring submission instead of a per-slot read loop. This covers both regular io_uring and NVMe passthrough (use_uring_cmd), since batched_read submits io_uring_cmd batches in passthrough mode.

Only how the headers are read changes — the engine-specific part is confined to header reading, and a single validation loop (identity / payload-length match, drop-on-mismatch) is shared across engines. The on-device layout is untouched.

Steps / Reproduction (if applicable)

Reproducible with a synthetic bringup benchmark that prepares many indexed entries (slot headers only) and measures bringup = checkpoint-load + header-validation time:

Prepare a checkpoint with a large number of indexed entries on the target device.
Bring up RawBlockCore and time the recovery (checkpoint load + header validation).
Compare the serial path against the parallel path (and varying reader parallelism) — the gap grows with entry count.

Expected Outcome / Goal

Lower restart/recovery bringup latency on large raw-block devices, with the gap over the serial path widening as the indexed entry count grows. A reusable bringup benchmark quantifies and protects the improvement.

Actual Outcome (if applicable)

Header validation runs serially (one synchronous read per slot), so bringup latency grows linearly with the number of indexed entries and leaves device parallelism (threads for POSIX, NVMe queue depth for io_uring) unused.

Additional Context

Work in progress as two complementary PRs (POSIX threadpool path + io_uring batched_read path, the latter also covering NVMe passthrough) plus the shared bringup benchmark; links will be added here as they open.
cc @DongDongJu (raw_block CODEOWNER)