Recovery Validation 실측 가이드 (실 NVMe)

이 PR이 검증하는 것 (io_uring block + POSIX). io_uring_cmd(passthrough)는 passthrough read 버그로 이 PR 제외 → uring_cmd_recovery_followup.md.

#	구현	검증 방법
1	io_uring batched recovery (block, non-uring_cmd)	벤치 §3 + 정확성 §4
2	queue_depth 단위 bounded batch 분할 (P2)	벤치 §3 (N>queue_depth) + 단위테스트 §2
3	batch I/O 에러 per-slot fallback	단위테스트 §2
4	벤치 `--io-engine` 파라미터화	§3 자체

0. 환경 요구사항

block device /dev/nvmeXn1 (덮어써도 되는 전용/빈 namespace. ⚠️ --prepare가 device를 덮어씀. 부팅/데이터 디스크 금지)
커널 io_uring 지원, Python 3.12 venv + Rust extension 빌드 (io_uring_cmd char device는 이 PR 범위 아님 — followup 참조)

1. 브랜치 가져오기 + 빌드

# (현재 PC) origin(fork)에 push
git push origin perf/iouring-recovery-batched-read
# (실측 PC) fetch + checkout (= 87638cb)
git fetch origin && git checkout perf/iouring-recovery-batched-read
uv venv --python 3.12 && source .venv/bin/activate
uv pip install torch && uv pip install -e . --no-build-isolation
uv pip install -r requirements/test.txt

Rust lmcache_rust_raw_block_io가 빌드돼야 함(없으면 단위테스트 importorskip, 벤치 import 에러).

2. 단위 테스트 (device 불필요)

pytest -xvs tests/v1/storage_backend/test_raw_block_core.py   # 기대: 14 passed

핵심 항목:

test_read_slot_headers_batched_splits_into_queue_depth_batches — queue_depth 분할 + 순서 보존(P2)
test_read_slot_headers_batched_falls_back_to_per_slot_on_error — fallback 격리(에러 시 1개만 None)
test_validate_loaded_entries_iouring_uses_batched_read / _uring_cmd_uses_sequential — dispatch
test_raw_block_core_drops_checkpoint_entry_with_stale_slot_header — stale 헤더 1개만 drop

3. 벤치 실측 (성능 + 정확성)

# fixture 준비 (block device). N = (cache-space - meta) / slot-bytes
sudo python benchmarks/storage_backend_io/raw_block_recovery_bringup_bench.py \
  --device-path /dev/nvme0n1 --cache-space-gb 50 --slot-bytes 1048576 \
  --prepare --i-understand-this-overwrites-device
# 측정: posix(threads 스윕) vs io_uring(block, batched)
sudo python benchmarks/storage_backend_io/raw_block_recovery_bringup_bench.py \
  --device-path /dev/nvme0n1 --cache-space-gb 50 --slot-bytes 1048576 \
  --measure --io-engine posix io_uring --threads 1 8 --repeats 5

posix는 --threads(1,8) → serial vs threadpool, io_uring은 block batched 단일
use_odirect 기본 True → posix/io_uring 모두 OS page cache 우회(실 device latency)

4. 결과 해석

정확성(필수): 모든 라벨 indexed=가 prepare한 N과 동일.
성능: 실 NVMe + 큰 N에서 io_uring(block)이 serial 대비 speedup>1. ⚠️ tmpfs/소량 N은 device latency≈0이라 무의미 — 반드시 실 NVMe + 큰 N.

실측 결과 (2026-06-23, 524,272 entries, /dev/nvme6n1)

경로	median	speedup
posix serial (t=1)	70.7s	1×
posix threadpool (t=8)	16.3s	4.35×
io_uring batched (block)	11.6s	6.08×

→ io_uring(block) batched가 threadpool보다 1.4× 빠르고 serial 대비 6.08×. 검증 완료. (io_uring_cmd는 indexed=0/EINVAL — followup 참조.)

5. 주의

⚠️ --prepare는 device 메타/헤더를 덮어씀. 빈/전용 namespace만.
fallback(배드섹터 격리)은 실 device 인위 I/O 에러가 어려워 §2 단위테스트로 검증. stale 헤더 케이스(_drops_..._stale_slot_header)로 "1개만 drop"은 실 device서도 확인 가능(검증 mismatch 경로).

0. 환경 요구사항​

1. 브랜치 가져오기 + 빌드​

2. 단위 테스트 (device 불필요)​

3. 벤치 실측 (성능 + 정확성)​

4. 결과 해석​

실측 결과 (2026-06-23, 524,272 entries, /dev/nvme6n1)​

5. 주의​