PR #3274 분석: io_uring + NVMe io_uring_cmd 지원

작성: 2026-05-26 / 최종 갱신: 2026-06-04 대상 PR: https://github.com/LMCache/LMCache/pull/3274
작성자: Ankit Kumar (@ankit-sam)
상태: 머지 임박 — 6/1 DongDongJu APPROVED, 6/3 ApostaC doc 코멘트만 남음 (REVIEW_REQUIRED)

6/3 ApostaC 코멘트 (doc consistency):

benchmarks/storage_backend_io/ README 업데이트 요청 (--use-uring-cmd 플래그 미반영)

docs/source/mp/l2_storage.rst 업데이트 요청 — use_uring_cmd, max_data_transfer_size 신규 config 항목 반영 필요

Ankit이 이 두 항목을 반영하면 머지 진행 가능한 상태. M1/M2 착수 윈도우가 곧 열림.

1. 개요

이 PR은 두 가지를 함께 담고 있다.

누락된 io_uring 변경사항 복구 — MP mode 통합 rebase 과정에서 빠진 io_uring 코드 재적용
NVMe io_uring_cmd (passthrough) 신규 지원 — 커널의 NVMe 드라이버를 거치지 않고 애플리케이션이 NVMe 명령을 직접 발행

M3 (io_uring setup flag 튜닝)의 전제가 되는 PR이다. IoUring::new() → IoUring::builder().build() 전환이 이 PR에서 이루어지고, M3는 그 builder에 setup flag를 추가하는 작업이다.

2. 변경된 파일

파일	변경 내용
`rust/raw_block/src/lib.rs`	IoUringWrapper, NVMe 구조체, register_fixed_buffers
`lmcache/v1/storage_backend/raw_block/core.py`	`_write_buffers`, `_read_buffers`, fixed buffer 등록
`lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py`	`use_uring_cmd`, `max_data_transfer_size` 설정 추가
`lmcache/bench/storage_backend_io_benchmark.py`	벤치마크 CLI에 `--use-uring-cmd` 플래그 추가

3. Rust 변경사항 (`lib.rs`)

3.1 IoUringWrapper — 커널 버전 호환성 추상화

#[derive(Clone)]
enum IoUringWrapper {
    Standard(Arc<Mutex<IoUring<SqueueEntry, Entry>>>),  // 커널 5.4~5.18
    Big(Arc<Mutex<IoUring<Entry128, Entry32>>>),         // 커널 5.19+
}

왜 두 가지가 필요한가?

타입	SQE 크기	CQE 크기	필요 커널	용도
`Standard` (SqueueEntry/Entry)	64 bytes	16 bytes	5.4+	일반 pread/pwrite
`Big` (Entry128/Entry32)	128 bytes	32 bytes	5.19+	`io_uring_cmd` (NVMe passthrough)

NVMe io_uring_cmd는 NVMe 명령 구조체(80 bytes)를 SQE에 inline으로 담아야 해서 128-byte SQE가 필요하다. 기존 64-byte SQE로는 불가능.

초기화 로직 — Big 먼저 시도, 실패 시 Standard fallback:

let ring = match IoUring::<Entry128, Entry32>::builder()
    .build(iouring_queue_depth as u32)
{
    Ok(big_ring) => IoUringWrapper::Big(Arc::new(Mutex::new(big_ring))),
    Err(_) => {
        if use_uring_cmd {
            return Err(PyRuntimeError::new_err(
                "io_uring_cmd requires kernel 5.19 or later",
            ));
        }
        let std_ring = IoUring::<SqueueEntry, Entry>::builder()
            .build(iouring_queue_depth as u32)...;
        IoUringWrapper::Standard(Arc::new(Mutex::new(std_ring)))
    }
};

M3 착수점: .build() 앞에 .setup_single_issuer().setup_defer_taskrun() 등을 추가하는 것이 M3의 핵심이다. 두 경로(Big, Standard) 모두에 적용해야 한다.

3.2 NvmeUringCmd — NVMe 명령 구조체

#[repr(C)]
#[derive(Debug, Clone, Copy)]
struct NvmeUringCmd {
    opcode: u8,      // NVME_IO_READ(0x02) / NVME_IO_WRITE(0x01)
    flags: u8,
    rsvd1: u16,
    nsid: u32,       // NVMe Namespace ID
    cdw2: u32,
    cdw3: u32,
    metadata: u64,
    addr: u64,       // 데이터 버퍼 주소
    metadata_len: u32,
    data_len: u32,   // 전송 크기 (bytes)
    cdw10: u32,      // SLBA[31:0] — 시작 LBA 하위 32비트
    cdw11: u32,      // SLBA[63:32] — 시작 LBA 상위 32비트
    cdw12: u32,      // NLB (Number of Logical Blocks - 1) | dtype
    cdw13: u32,      // dspec (FDP placement ID 여기 들어감)
    cdw14: u32,
    cdw15: u32,
    rsvd2: [u32; 4],
}

cdw13의 dspec 필드가 FDP와 연결된다. FDP placement_id를 이 필드에 넣으면 NVMe 레벨에서 데이터 배치 스트림을 지정할 수 있다. 이것이 한대규의 FDP 설계와 연동되는 지점이다.

3.3 nvme_uring_cmd_prep — NVMe 명령 빌드

fn nvme_uring_cmd_prep(
    cmd: &mut NvmeUringCmd,
    is_write: bool,
    nsid: u32,
    offset: u64,   // 바이트 오프셋
    len: usize,
    lba_shift: u32,  // LBA 크기 = 1 << lba_shift (보통 9 = 512B, 12 = 4KB)
    ptr: *const u8,
    dtype: u8,       // Directive Type (FDP = 2)
    dspec: u16,      // Directive Specific (FDP placement handle ID)
) -> Result<(), PyErr> {
    let slba = offset >> lba_shift;  // 바이트 오프셋 → LBA 번호
    let nlb = (len >> lba_shift) - 1;

    cmd.cdw10 = (slba & 0xFFFFFFFF) as u32;
    cmd.cdw11 = (slba >> 32) as u32;
    cmd.cdw12 = nlb as u32 | ((dtype as u32) << 20);
    cmd.cdw13 = (dspec as u32) << 16;
    cmd.addr = ptr as u64;
    cmd.data_len = len as u32;
    ...
}

LBA 변환 흐름:

바이트 오프셋 → SLBA (LBA 번호) = offset >> lba_shift
전송 크기    → NLB (블록 수 - 1) = (len >> lba_shift) - 1

3.4 register_fixed_buffers — zero-copy I/O 등록

fn register_fixed_buffers(&self, buffer_ptrs: Vec<usize>, buffer_sizes: Vec<usize>) -> PyResult<()> {
    // ptr → (idx, size) 맵 구성
    let mut map = self.fixed_buffer_map.lock().unwrap();
    for (idx, (ptr, size)) in buffer_ptrs.iter().zip(buffer_sizes.iter()).enumerate() {
        map.insert(*ptr, (idx as u16, *size));
    }

    // 커널에 버퍼 등록
    let iovecs: Vec<libc::iovec> = buffer_ptrs.iter().zip(buffer_sizes.iter())
        .map(|(ptr, size)| libc::iovec { iov_base: *ptr as *mut _, iov_len: *size })
        .collect();
    ring.submitter().register_buffers(&iovecs)  // syscall 1번으로 N개 버퍼 등록
}

왜 중요한가? 버퍼를 미리 커널에 등록하면 I/O 시 buf_index로 참조할 수 있어서 매번 주소 번역이 필요 없다. GPU/CPU 메모리를 한 번 핀하고 재사용.

3.5 Worker thread 배치 submit 패턴

// 큐에서 I/O 요청 묶어서 한 번에 submit
let mut batch: Vec<IoSubmission> = std::mem::take(&mut *q);
for sub in batch.iter().take(to_submit_count) {
    build_and_submit_sqe(&ring_clone, sub, user_data);  // SQ에 추가 (syscall 아님)
}
// 모아서 한 번에 커널로
ring.submitter().submit()  // syscall 1번

요청을 하나씩 submit() 하지 않고 SQ에 쌓은 뒤 한 번의 syscall로 배치 제출한다. N개 I/O에 대해 syscall이 1번으로 줄어드는 핵심 구조다.

4. Python 변경사항 (`core.py`)

4.1 `_write_buffers` / `_read_buffers` — I/O 경로 추상화

io_engine = "posix"     → pwrite_from_buffer (기존 동기 I/O)
io_engine = "io_uring"
  └── use_uring_cmd=True  → _write_uring_cmd_buffers (NVMe passthrough)
  └── use_uring_cmd=False
        └── payload==total (정렬됨)  → batched_write + wait_iouring (배치 비동기)
        └── 그 외               → write_uring (개별 비동기)

핵심 변화: 기존 _write_one이 header/payload를 2번 pwrite했던 것과 달리, _write_buffers는 io_uring 경로에서 batched_write로 여러 I/O를 한 번에 submit할 수 있다.

4.2 `register_fixed_buffers_from_allocator`

def register_fixed_buffers_from_allocator(self, memory_allocator) -> None:
    buffers = memory_allocator.get_paged_buffers()
    buffer_ptrs = [buf.data_ptr() for buf in buffers]
    buffer_sizes = [buf.numel() * buf.element_size() for buf in buffers]
    self._rawdev().register_fixed_buffers(buffer_ptrs, buffer_sizes)

CPU allocator의 페이지 버퍼를 io_uring에 등록. 이후 해당 버퍼로의 I/O는 zero-copy로 처리된다.

4.3 `max_hw_sectors_kb` — 자동 전송 크기 분할

max_hw_sectors_kb = _read_sysfs_int(f"{queue_dir}/max_hw_sectors_kb")
resolved_bytes = max_hw_sectors_kb * 1024
aligned_bytes = (resolved_bytes // self.block_align) * self.block_align

NVMe 디바이스가 한 번에 처리할 수 있는 최대 전송 크기를 sysfs에서 읽어서, 큰 I/O를 자동으로 분할한다. KV 청크가 이 크기를 초과하면 여러 NVMe 명령으로 나눠 발행.

5. 전체 I/O 흐름 (io_uring_cmd 경로)

Python: put_many(keys, objs)
  └─ _write_buffers(offsets, bufs, ...)
       └─ use_uring_cmd=True
            └─ _write_uring_cmd_buffers()
                 └─ nvme_uring_cmd_prep(cmd, offset, len, dspec=placement_id)
                      └─ IoUringWrapper::Big → UringCmd80 → SQ push
                           └─ submitter().submit()  [syscall 1번]
                                └─ NVMe HW: SLBA, NLB, FDP dspec 처리
                                     └─ wait_iouring(batch_id) → CQ 수거

6. M3와의 관계

#3274가 완성하면 builder 패턴이 도입되어 있지만 setup flag가 없다:

// #3274 이후 상태 (M3 착수점)
IoUring::<Entry128, Entry32>::builder()
    .build(iouring_queue_depth as u32)  // ← 여기에 플래그 추가가 M3

IoUring::<SqueueEntry, Entry>::builder()
    .build(iouring_queue_depth as u32)  // ← 이 fallback 경로에도 적용

Worker thread가 single issuer 구조임에도 커널이 그 사실을 모르는 상태.
M3에서 setup_single_issuer(), setup_defer_taskrun() 추가 → SQ submission 오버헤드 감소.

7. FDP 연동 가능성

nvme_uring_cmd_prep의 dspec 파라미터가 FDP placement handle을 전달하는 필드다.

cdw13 = (dspec as u32) << 16   // FDP placement_id가 여기 들어감

한대규의 FDP 설계(placement_id 결정 로직)가 완성되면, 이 dspec에 placement_id를 넘기는 것으로 FDP가 활성화된다. #3274가 그 배관을 완성하는 PR이다.

8. 현재 제약 및 리뷰 피드백

항목	상태	내용
fixed buffer + uring_cmd	미구현 (PR 본문 명시)	향후 추가 예정
정렬 검증	✅ 해소 (6/1)	DongDongJu 코멘트 반영 — `use_uring_cmd` 자체 transfer-alignment 조건 명확화
블록 정렬 전송만 지원	유지	비정렬 I/O 미지원
`--use-uring-cmd` UX	✅ 해소 추정	Ankit 6/1 업데이트에서 에러 메시지 개선 가능성
doc 미업데이트	⏳ 대기 (ApostaC 6/3 코멘트)	`benchmarks/README.md` + `docs/source/mp/l2_storage.rst` 업데이트 필요 — 이게 해소되면 머지

6/1 DongDongJu APPROVED 요약

use_uring_cmd 경로에 자체 transfer-alignment 검증 조건이 있어 안전
device/transfer-limit validation 구조가 명확해짐
LGTM — Python core path 클리어

머지 블로커 (6/4 기준)

doc 코멘트 2개 반영만 남음. 코드는 APPROVED 상태. Ankit이 커밋 추가하면 머지 진행 가능.

9. M1/M2 착수 연결

#3274 머지 직후 M1/M2 착수 가능:

M1: _write_one/load_many_into에 io_engine="io_uring" 분기 추가 — batched_write/batched_read 경로로 연결
M2: IoUring::builder()에 setup_single_issuer(), setup_defer_taskrun() 추가 — Big + Standard 두 경로 모두 적용 필요 (§3.1 참고)
본 위키가 M1/M2 PR 본문/설계의 베이스로 그대로 활용 가능

1. 개요​

2. 변경된 파일​

3. Rust 변경사항 (lib.rs)​

3.1 IoUringWrapper — 커널 버전 호환성 추상화​

3.2 NvmeUringCmd — NVMe 명령 구조체​

3.3 nvme_uring_cmd_prep — NVMe 명령 빌드​

3.4 register_fixed_buffers — zero-copy I/O 등록​

3.5 Worker thread 배치 submit 패턴​

4. Python 변경사항 (core.py)​

4.1 _write_buffers / _read_buffers — I/O 경로 추상화​

4.2 register_fixed_buffers_from_allocator​

4.3 max_hw_sectors_kb — 자동 전송 크기 분할​

5. 전체 I/O 흐름 (io_uring_cmd 경로)​

6. M3와의 관계​

7. FDP 연동 가능성​

8. 현재 제약 및 리뷰 피드백​

6/1 DongDongJu APPROVED 요약​

머지 블로커 (6/4 기준)​

9. M1/M2 착수 연결​