raw_block — Recovery Header Validation 병렬화

[!tldr] 업무 관점 takeaway raw_block 재시작 시 N개 슬롯 헤더를 직렬 pread(N×100µs)로 검증하던 병목을 io_uring batched_read 단일 제출로 해결. 실 NVMe(524K entries) 기준 serial 70.7s → io_uring 6.08×, POSIX threadpool 4.35×. HC-SSD(15TB, ~3750슬롯) 기준 이론 ~94× 단축 가능. DG(POSIX threadpool + 인터페이스 골격) + NY(io_uring 구현)로 분담, 통합 완료 → upstream PR 대기 중.

1. 문제

RawBlockCore 재시작 시 최신 checkpoint를 로드해 in-memory index를 복원한 뒤, 인덱스의 각 엔트리마다 슬롯 헤더를 읽어 identity/payload_len이 metadata와 일치하는지 검증한다(_validate_loaded_entries).

기존 구현:

for encoded_key, entry in items:
    slot_hdr = self._read_slot_header(int(entry.offset))   # 슬롯당 동기 pread 1회

→ 엔트리 N개면 4KiB짜리 작은 동기 read N회를 직렬로 호출. NVMe queue depth를 전혀 못 씀.

스케일	serial 예상	batched_read 예상
HC-SSD 15TB, slot 4MB, ~3750슬롯	~375ms	~4ms
실측 (524K entries, /dev/nvme6n1)	70.7s	11.6s (6.08×)

2. 두 브랜치 분담

영역	담당	메커니즘
dispatch 인터페이스 + serial fallback	DG (daegyu94)	`_read_slot_headers` 진입점 신설, `_is_stale_header` 분리
POSIX 병렬 읽기	DG	`ThreadPoolExecutor`(8 reader), range 분할 병렬 pread
io_uring batched 읽기	NY	`batched_read` 단일 제출 + `wait_iouring`
공유 bringup 벤치마크	DG	`raw_block_recovery_bringup_bench.py`

3. 통합 인터페이스 (합의안)

핵심 원칙: 검증 로직(identity/payload_len 비교 → drop)은 엔진 무관하게 동일. 분기는 "읽기" 한 곳에만.

읽기(엔진별 분기)        →  판정(엔진 무관)   →  적용(drop)
_read_slot_headers(offsets)   _is_stale_header    _validate_loaded_entries 본문

def _read_slot_headers(self, offsets: list[int]) -> list[Optional[tuple[int, int]]]:
    n = len(offsets)
    if self.io_engine == "io_uring" and not self.use_uring_cmd and n > 1:
        return self._read_slot_headers_batched(offsets)         # NY
    if self.io_engine == "posix" and self._recovery_read_threads > 1 and n > 1:
        return self._read_slot_headers_posix_parallel(offsets)  # DG
    return [self._read_slot_header(off) for off in offsets]      # serial fallback

검증 루프는 엔진 분기 없이 단 하나:

to_drop = [
    key
    for (key, entry), hdr in zip(items, headers, strict=True)
    if self._is_stale_header(key, entry, hdr)
]

4. io_uring 구현 세부 (NY)

4.1 aligned buffer + 단일 batch 제출

def _read_slot_headers_batched(self, offsets: list[int]) -> list[Optional[tuple[int, int]]]:
    n = len(offsets)
    align = self.block_align
    hdr = self.header_bytes
    raw_buf = bytearray(n * hdr + align - 1)
    addr = ctypes.addressof(ctypes.c_byte.from_buffer(raw_buf))
    pad = (-addr) % align
    views = [memoryview(raw_buf)[pad + i*hdr : pad + (i+1)*hdr] for i in range(n)]

    with self._lock:
        self._inflight_io_count += 1
    try:
        raw_dev = self._rawdev()
        batch_id = raw_dev.batched_read(offsets, views, [hdr] * n)
        raw_dev.wait_iouring(batch_id)
    except Exception:
        return [self._read_slot_header(off) for off in offsets]  # per-slot fallback
    finally:
        with self._lock:
            self._inflight_io_count -= 1
            self._last_io_ts = time.monotonic()

    return [self._decode_slot_header(bytes(v)) for v in views]

4.2 batch I/O 에러 격리 (per-slot fallback)

wait_iouring은 batch 중 하나라도 에러면 전체를 PyOSError로 반환 — 어느 슬롯 실패인지 모름. 기존 except: return [None]*n이면 슬롯 1개 배드섹터로 유효 엔트리 전부 drop 되는 회귀.

→ 수정: except Exception: return [self._read_slot_header(off) for off in offsets] — 배드 슬롯만 None, 나머지 복원.

4.3 queue_depth 단위 bounded batch 분할

전체 N개를 단일 batch로 보내면 100만 엔트리×4KB = 4GiB 단일 할당. batch_size = max(1, iouring_queue_depth) (=256)으로 분할해 ~1MB씩 bound.

5. 실 NVMe 검증 결과 (2026-06-23, 524,272 entries, /dev/nvme6n1)

경로	median	speedup
posix serial (t=1)	70.7s	1×
posix threadpool (t=8)	16.3s	4.35×
io_uring batched (block)	11.6s	6.08×

단위 테스트 14개 통과 (dispatch 정확성, queue_depth 분할, fallback 격리, stale 헤더 drop)
pre-commit 통과

6. uring_cmd (NVMe passthrough) — 이 PR에서 제외

같은 실측에서 io_uring_cmd 경로(char device /dev/ng6n1)는 indexed=0 (checkpoint meta payload read EINVAL).

원인: _load_meta_payload가 bytearray(total_len) 정확한 크기의 비정렬 버퍼를 할당 → uring_cmd passthrough는 multi-page 전송 시 PRP 페이지 정렬 필요 → 2MiB(512페이지) 비정렬이 PRP 위반 → EINVAL. header(단일 페이지)는 통과해서 그동안 안 드러남.

기존 fix 브랜치 priv/sy/fix/raw-block-uring-cmd-aligned-buffers는 padding 경로만 커버, recovery 경로 미커버.

수정 방향: Rust read_uring에서 비정렬 ptr 무조건 AlignedBuf bounce → 별도 Rust 이슈/PR (@DongDongJu CODEOWNER).

재활성화 시: dispatch not self.use_uring_cmd 조건 제거(1줄) → uring_cmd도 batched 경로.

7. PR / 머지 순서

DG PR — _read_slot_headers dispatch + _is_stale_header + serial + POSIX threadpool + 벤치마크. ✅ 구현 완료 (origin/priv/dg/raw-block-multithreads-recovery, top c85cb720).
NY PR — DG 위에 rebase, io_uring 분기를 _read_slot_headers_batched로 채움. ✅ 구현 완료 (perf/iouring-recovery-batched-read, 커밋 87638cb).
남은 것: DG 골격 PR upstream 제출 → NY를 그 위에 올림 → dev 리베이스 → 총괄 이슈 등록 + PR 링크.

8. GitHub 이슈 초안

제목: [Perf][RawBlock] Reduce restart bringup time by parallelizing recovery slot-header validation

요점: restart 시 header validation을 엔진별 병렬화(POSIX threadpool / io_uring batched_read)로 대체해 bringup latency 단축. cc @DongDongJu. 2개 PR로 커버 예정.

1. 문제​

2. 두 브랜치 분담​

3. 통합 인터페이스 (합의안)​

4. io_uring 구현 세부 (NY)​

4.1 aligned buffer + 단일 batch 제출​

4.2 batch I/O 에러 격리 (per-slot fallback)​

4.3 queue_depth 단위 bounded batch 분할​

5. 실 NVMe 검증 결과 (2026-06-23, 524,272 entries, /dev/nvme6n1)​

6. uring_cmd (NVMe passthrough) — 이 PR에서 제외​

7. PR / 머지 순서​

8. GitHub 이슈 초안​