raw_block io_uring I/O 인프라

[!tldr] 업무 관점 takeaway #3274로 모든 device I/O가 단일 dispatcher(_write_buffers/_read_buffers)를 통과하도록 정리됐다. io_uring에서 N건을 한 번에 제출하는 batched_write API도 생겼다. 단 슬롯 헤더(32B)가 O_DIRECT 4096B 정렬 불일치 때문에 can_batch=False → 다중 키 쓰기는 여전히 키마다 개별 제출이다. 인프라는 갖춰졌고, 이를 활용하는 P0(put_many N-SQE 배치)가 다음 단계.

1. 배경 — 직접 호출에서 dispatcher로

이전에는 모든 I/O 호출자(슬롯 write, 다중 로드, 메타 검증, 체크포인트 R/W)가 Rust 바인딩의 pwrite_from_buffer/pread_into를 직접 호출했다. 단일 진입점이 없으니 io_uring 분기를 넣으면 전체 코드에 흩어졌다.

해결책: 단일 dispatcher 도입.

2. `_write_buffers` / `_read_buffers` dispatcher

호출자 (슬롯 write / 다중 load / 메타 검증 / 체크포인트 R/W)
        │
        ▼
  _write_buffers(offsets, buffers, payload_lens, total_lens)
  _read_buffers (offsets, buffers, payload_lens, total_lens)
        │
        ├─ io_engine != "io_uring"
        │     └─ 버퍼마다: pwrite_from_buffer / pread_into   (POSIX)
        │
        ├─ io_engine == "io_uring" + uring_cmd 모드
        │     └─ NVMe passthrough (MDTS 단위 chunk split)
        │
        └─ io_engine == "io_uring" (일반 block 모드)
              ├─ can_batch:  batched_write / batched_read  +  wait_iouring
              └─ else:       write_uring / read_uring       (동기 한 건씩)

인자는 병렬 리스트. offsets[i], buffers[i], payload_lens[i], total_lens[i]가 i번째 I/O를 기술한다.

3. `can_batch` — batch 제출의 핵심 조건

can_batch = all(payload_len == total_len for ...)

모든 버퍼의 payload_len == total_len일 때만 batched_write를 쓴다. 이유: batched_write는 padding 없이 (offset, buffer, len) 그대로를 SQE에 넣기 때문이다.

슬롯 헤더+페이로드 쌍이 batch를 못 타는 이유

버퍼	payload_len	total_len	일치?
헤더	32B	4096B (O_DIRECT round-up)	불일치
페이로드	정렬됨	정렬됨	일치

헤더가 있으면 can_batch=False → 개별 write_uring 2회로 떨어진다.

호출자별 실제 경로

호출자	리스트 길이	can_batch	경로
슬롯 write (쓰기 루프)	2 (헤더+페이로드)	False	`write_uring` × 2
다중 load	1 (페이로드)	True	`batched_read([1])`
메타데이터 검증	1 (헤더)	True	`batched_read([1])`
체크포인트 read	1	True	`batched_read([1])`
체크포인트 write	2 (둘 다 block_align 배수)	True	`batched_write([2])`

→ 다중 키 쓰기 루프가 device 큐를 채우지 못하는 근본 원인. 이를 해결하는 것이 P0.

4. Rust 측 구성

`IoUringWrapper` — 커널 호환 dual ring

모드	SQE/CQE 크기	용도
Standard	64B / 16B	일반 read/write
Big	128B / 32B	NVMe `uring_cmd`(passthrough)

커널 5.19+ 에서만 Big ring(Big SQE/CQE ABI) 가능. NVMe passthrough를 쓰려면 필수.

batch 제출 API

fn batched_write(offsets: Vec<u64>, buffers: Vec<PyAny>, total_lens: Vec<usize>) -> u64
fn wait_iouring(batch_id: u64)

batched_write는 N개 (offset, buffer, len)을 받아 한 번의 io_uring_submit으로 N개 SQE를 제출. wait_iouring로 모든 CQE 수거 대기.

fixed buffer 등록

register_fixed_buffers_from_allocator(allocator)로 페이지 버퍼를 미리 등록(register_buffers) → zero-copy I/O. 할당자가 메서드를 노출하지 않으면 non-fixed 모드 fallback.

5. 현황 요약

기능	상태
단일 I/O dispatcher	✅
io_uring N-건 batch API (`batched_write` + `wait_iouring`)	✅
zero-copy fixed buffer 등록	✅
NVMe passthrough(`uring_cmd`)	✅ (Big ring)
다중 키 쓰기에서 N-건 batch 활용	❌ 헤더 정렬로 `can_batch=False`

참고

PR #3274 (ankit-sam): 이 인프라의 출처
커널 요건: io_uring 5.19+, O_DIRECT: 4096B 배수 정렬 필요
후속: [[raw_block-put_many-최적화]] — 인프라를 활용해 다중 키 쓰기를 N-SQE 배치로 전환

1. 배경 — 직접 호출에서 dispatcher로​

2. _write_buffers / _read_buffers dispatcher​

3. can_batch — batch 제출의 핵심 조건​

슬롯 헤더+페이로드 쌍이 batch를 못 타는 이유​

호출자별 실제 경로​

4. Rust 측 구성​

IoUringWrapper — 커널 호환 dual ring​

batch 제출 API​

fixed buffer 등록​

5. 현황 요약​

참고​