๋ณธ๋ฌธ์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

raw_block io_uring_cmd ๊ตฌํ˜„ (PR #3274)

[!tldr] ์—…๋ฌด ๊ด€์  takeaway LMCache raw_block ๋ฐฑ์—”๋“œ์— NVMe io_uring_cmd passthrough ๊ฒฝ๋กœ๊ฐ€ ์ถ”๊ฐ€๋๋‹ค. Block Layer๋ฅผ ์™„์ „ํžˆ ์šฐํšŒํ•ด NVMe ๋“œ๋ผ์ด๋ฒ„์— ์ง์ ‘ ๋ช…๋ น์„ ๋ณด๋‚ธ๋‹ค. ํ•ต์‹ฌ์€ NvmeUringCmd.cdw13์˜ dspec ํ•„๋“œ โ€” ์—ฌ๊ธฐ์— FDP placement_id๋ฅผ ๋„ฃ์œผ๋ฉด NVMe ๋ ˆ๋ฒจ์—์„œ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜ ์ŠคํŠธ๋ฆผ์ด ์ง€์ •๋œ๋‹ค. ์ฆ‰ ์ด PR์ด FDP Backend์˜ ๋ฐฐ๊ด€์„ ์™„์„ฑํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  builder() ํŒจํ„ด ๋„์ž…์œผ๋กœ M3(io_uring setup flag ํŠœ๋‹)์˜ ์ฐฉ์ˆ˜์ ๋„ ๋œ๋‹ค.


PR ๊ฐœ์š”โ€‹

ํ•ญ๋ชฉ๋‚ด์šฉ
PR#3274
์ž‘์„ฑ์žAnkit Kumar (@ankit-sam)
์ƒํƒœOpen (๋ฆฌ๋ทฐ ์ค‘)
์š”์•ฝMP ๋ชจ๋“œ ํ†ตํ•ฉ rebase ์ค‘ ๋ˆ„๋ฝ๋œ io_uring ์ฝ”๋“œ ๋ณต๊ตฌ + NVMe io_uring_cmd passthrough ์‹ ๊ทœ ์ถ”๊ฐ€

I/O ๊ฒฝ๋กœ ๋น„๊ตโ€‹

๊ธฐ์กด io_uring:
App โ†’ io_uring (SQE 64B) โ†’ Block Layer โ†’ NVMe Driver โ†’ SSD

io_uring_cmd (์ด PR):
App โ†’ io_uring_cmd (big SQE 128B) โ†’ NVMe Driver โ†’ SSD
โ†‘
Block Layer ์™„์ „ ์ƒ๋žต

Rust ๋ณ€๊ฒฝ์‚ฌํ•ญ (lib.rs)โ€‹

IoUringWrapper โ€” ์ปค๋„ ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ ์ถ”์ƒํ™”โ€‹

#[derive(Clone)]
enum IoUringWrapper {
Standard(Arc<Mutex<IoUring<SqueueEntry, Entry>>>), // ์ปค๋„ 5.4~5.18
Big(Arc<Mutex<IoUring<Entry128, Entry32>>>), // ์ปค๋„ 5.19+
}
ํƒ€์ž…SQE ํฌ๊ธฐCQE ํฌ๊ธฐํ•„์š” ์ปค๋„์šฉ๋„
Standard64 bytes16 bytes5.4+์ผ๋ฐ˜ pread/pwrite
Big128 bytes32 bytes5.19+io_uring_cmd (NVMe passthrough)

NVMe ๋ช…๋ น ๊ตฌ์กฐ์ฒด(80 bytes)๋ฅผ SQE์— inline์œผ๋กœ ๋‹ด์œผ๋ ค๋ฉด 128-byte SQE๊ฐ€ ํ•„์ˆ˜.

์ดˆ๊ธฐํ™” โ€” Big ๋จผ์ € ์‹œ๋„, ์‹คํŒจ ์‹œ Standard fallback:

let ring = match IoUring::<Entry128, Entry32>::builder()
.build(iouring_queue_depth as u32)
{
Ok(big_ring) => IoUringWrapper::Big(Arc::new(Mutex::new(big_ring))),
Err(_) => {
if use_uring_cmd {
return Err(PyRuntimeError::new_err(
"io_uring_cmd requires kernel 5.19 or later",
));
}
// fallback: Standard
IoUringWrapper::Standard(Arc::new(Mutex::new(std_ring)))
}
};

M3 ์ฐฉ์ˆ˜์ : .build() ์•ž์— .setup_single_issuer().setup_defer_taskrun() ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒŒ M3์˜ ํ•ต์‹ฌ. Big/Standard ๋‘ ๊ฒฝ๋กœ ๋ชจ๋‘์— ์ ์šฉํ•ด์•ผ ํ•œ๋‹ค.


NvmeUringCmd โ€” NVMe ๋ช…๋ น ๊ตฌ์กฐ์ฒดโ€‹

#[repr(C)]
#[derive(Debug, Clone, Copy)]
struct NvmeUringCmd {
opcode: u8, // NVME_IO_READ(0x02) / NVME_IO_WRITE(0x01)
flags: u8,
rsvd1: u16,
nsid: u32, // NVMe Namespace ID
cdw2: u32,
cdw3: u32,
metadata: u64,
addr: u64, // ๋ฐ์ดํ„ฐ ๋ฒ„ํผ ์ฃผ์†Œ
metadata_len: u32,
data_len: u32, // ์ „์†ก ํฌ๊ธฐ (bytes)
cdw10: u32, // SLBA[31:0] โ€” ์‹œ์ž‘ LBA ํ•˜์œ„ 32๋น„ํŠธ
cdw11: u32, // SLBA[63:32] โ€” ์‹œ์ž‘ LBA ์ƒ์œ„ 32๋น„ํŠธ
cdw12: u32, // NLB (Number of Logical Blocks - 1) | dtype
cdw13: u32, // dspec โ€” FDP placement handle ID โ† ํ•ต์‹ฌ
cdw14: u32,
cdw15: u32,
rsvd2: [u32; 4],
}

cdw13์˜ dspec ํ•„๋“œ๊ฐ€ FDP ์—ฐ๊ฒฐ ์ง€์ ์ด๋‹ค. FDP placement_id๋ฅผ ์—ฌ๊ธฐ ๋„ฃ์œผ๋ฉด NVMe ๋ ˆ๋ฒจ์—์„œ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜ ์ŠคํŠธ๋ฆผ์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.


nvme_uring_cmd_prep โ€” NVMe ๋ช…๋ น ๋นŒ๋“œโ€‹

fn nvme_uring_cmd_prep(
cmd: &mut NvmeUringCmd,
is_write: bool,
nsid: u32,
offset: u64, // ๋ฐ”์ดํŠธ ์˜คํ”„์…‹
len: usize,
lba_shift: u32, // LBA ํฌ๊ธฐ = 1 << lba_shift (9=512B, 12=4KB)
ptr: *const u8,
dtype: u8, // Directive Type (FDP = 2)
dspec: u16, // FDP placement handle ID
) {
let slba = offset >> lba_shift; // ๋ฐ”์ดํŠธ ์˜คํ”„์…‹ โ†’ LBA ๋ฒˆํ˜ธ
let nlb = (len >> lba_shift) - 1; // ์ „์†ก ํฌ๊ธฐ โ†’ ๋ธ”๋ก ์ˆ˜ - 1

cmd.cdw10 = (slba & 0xFFFFFFFF) as u32;
cmd.cdw11 = (slba >> 32) as u32;
cmd.cdw12 = nlb as u32 | ((dtype as u32) << 20);
cmd.cdw13 = (dspec as u32) << 16; // FDP placement_id ์—ฌ๊ธฐ
cmd.addr = ptr as u64;
cmd.data_len = len as u32;
}

LBA ๋ณ€ํ™˜:

๋ฐ”์ดํŠธ ์˜คํ”„์…‹ โ†’ SLBA = offset >> lba_shift
์ „์†ก ํฌ๊ธฐ โ†’ NLB = (len >> lba_shift) - 1

register_fixed_buffers โ€” zero-copy I/O ๋“ฑ๋กโ€‹

fn register_fixed_buffers(&self, buffer_ptrs: Vec<usize>, buffer_sizes: Vec<usize>) -> PyResult<()> {
let mut map = self.fixed_buffer_map.lock().unwrap();
for (idx, (ptr, size)) in buffer_ptrs.iter().zip(buffer_sizes.iter()).enumerate() {
map.insert(*ptr, (idx as u16, *size));
}
let iovecs: Vec<libc::iovec> = ...;
ring.submitter().register_buffers(&iovecs) // syscall 1๋ฒˆ์œผ๋กœ N๊ฐœ ๋ฒ„ํผ ๋“ฑ๋ก
}

๋ฒ„ํผ๋ฅผ ์ปค๋„์— ๋ฏธ๋ฆฌ ๋“ฑ๋กํ•˜๋ฉด I/O ์‹œ buf_index๋กœ ์ฐธ์กฐ ๊ฐ€๋Šฅ โ†’ ๋งค๋ฒˆ ์ฃผ์†Œ ๋ฒˆ์—ญ ๋ถˆํ•„์š”. CPU/GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•œ ๋ฒˆ ํ•€ํ•˜๊ณ  ์žฌ์‚ฌ์šฉํ•˜๋Š” ๊ตฌ์กฐ.


Worker thread ๋ฐฐ์น˜ submit ํŒจํ„ดโ€‹

let batch: Vec<IoSubmission> = std::mem::take(&mut *q);
for sub in batch.iter().take(to_submit_count) {
build_and_submit_sqe(&ring_clone, sub, user_data); // SQ์— ์ถ”๊ฐ€ (syscall ์•„๋‹˜)
}
ring.submitter().submit() // syscall 1๋ฒˆ์œผ๋กœ N๊ฐœ ํ•œ๊บผ๋ฒˆ์— ์ปค๋„๋กœ

์š”์ฒญ์„ SQ์— ์Œ“์€ ๋’ค ํ•œ ๋ฒˆ์˜ syscall๋กœ ๋ฐฐ์น˜ ์ œ์ถœ โ†’ N๊ฐœ I/O์— syscall 1๋ฒˆ.


Python ๋ณ€๊ฒฝ์‚ฌํ•ญ (core.py)โ€‹

_write_buffers / _read_buffers โ€” I/O ๊ฒฝ๋กœ ๋ผ์šฐํŒ…โ€‹

io_engine = "posix" โ†’ pwrite_from_buffer (๊ธฐ์กด ๋™๊ธฐ I/O)
io_engine = "io_uring"
โ””โ”€โ”€ use_uring_cmd=True โ†’ _write_uring_cmd_buffers (NVMe passthrough)
โ””โ”€โ”€ use_uring_cmd=False
โ””โ”€โ”€ payload==total (์ •๋ ฌ๋จ) โ†’ batched_write + wait_iouring (๋ฐฐ์น˜ ๋น„๋™๊ธฐ)
โ””โ”€โ”€ ๊ทธ ์™ธ โ†’ write_uring (๊ฐœ๋ณ„ ๋น„๋™๊ธฐ)

๊ธฐ์กด _write_one์ด header/payload๋ฅผ 2๋ฒˆ pwriteํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, io_uring ๊ฒฝ๋กœ๋Š” batched_write๋กœ ์—ฌ๋Ÿฌ I/O๋ฅผ ํ•œ ๋ฒˆ์— submit.

register_fixed_buffers_from_allocatorโ€‹

def register_fixed_buffers_from_allocator(self, memory_allocator) -> None:
buffers = memory_allocator.get_paged_buffers()
buffer_ptrs = [buf.data_ptr() for buf in buffers]
buffer_sizes = [buf.numel() * buf.element_size() for buf in buffers]
self._rawdev().register_fixed_buffers(buffer_ptrs, buffer_sizes)

CPU allocator์˜ ํŽ˜์ด์ง€ ๋ฒ„ํผ๋ฅผ io_uring์— ๋“ฑ๋ก โ†’ ์ดํ›„ ํ•ด๋‹น ๋ฒ„ํผ I/O๋Š” zero-copy.

max_hw_sectors_kb โ€” ์ž๋™ ์ „์†ก ํฌ๊ธฐ ๋ถ„ํ• โ€‹

max_hw_sectors_kb = _read_sysfs_int(f"{queue_dir}/max_hw_sectors_kb")
resolved_bytes = max_hw_sectors_kb * 1024
aligned_bytes = (resolved_bytes // self.block_align) * self.block_align

NVMe ๋””๋ฐ”์ด์Šค๊ฐ€ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ํฌ๊ธฐ๋ฅผ sysfs์—์„œ ์ฝ์–ด์„œ, KV ์ฒญํฌ๊ฐ€ ์ดˆ๊ณผํ•˜๋ฉด ์—ฌ๋Ÿฌ NVMe ๋ช…๋ น์œผ๋กœ ๋ถ„ํ•  ๋ฐœํ–‰.


์ „์ฒด I/O ํ๋ฆ„ (io_uring_cmd ๊ฒฝ๋กœ)โ€‹

Python: put_many(keys, objs)
โ””โ”€ _write_buffers(offsets, bufs, ...)
โ””โ”€ use_uring_cmd=True
โ””โ”€ _write_uring_cmd_buffers()
โ””โ”€ nvme_uring_cmd_prep(cmd, offset, len, dspec=placement_id)
โ””โ”€ IoUringWrapper::Big โ†’ UringCmd80 โ†’ SQ push
โ””โ”€ submitter().submit() [syscall 1๋ฒˆ]
โ””โ”€ NVMe HW: SLBA, NLB, FDP dspec ์ฒ˜๋ฆฌ
โ””โ”€ wait_iouring(batch_id) โ†’ CQ ์ˆ˜๊ฑฐ

M3์™€์˜ ๊ด€๊ณ„โ€‹

์ด PR์ด ์™„์„ฑ๋˜๋ฉด builder() ํŒจํ„ด์ด ๋„์ž…๋˜์ง€๋งŒ setup flag๋Š” ์—†๋‹ค:

// #3274 ์ดํ›„ ์ƒํƒœ (M3 ์ฐฉ์ˆ˜์ )
IoUring::<Entry128, Entry32>::builder()
.build(iouring_queue_depth as u32) // โ† ์—ฌ๊ธฐ์— ํ”Œ๋ž˜๊ทธ ์ถ”๊ฐ€๊ฐ€ M3

IoUring::<SqueueEntry, Entry>::builder()
.build(iouring_queue_depth as u32) // โ† fallback ๊ฒฝ๋กœ์—๋„ ๋™์ผ ์ ์šฉ

Worker thread๊ฐ€ single issuer ๊ตฌ์กฐ์ž„์—๋„ ์ปค๋„์ด ๊ทธ ์‚ฌ์‹ค์„ ๋ชจ๋ฅด๋Š” ์ƒํƒœ.
M3์—์„œ setup_single_issuer(), setup_defer_taskrun() ์ถ”๊ฐ€ โ†’ SQ submission ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ.


FDP ์—ฐ๋™ ๊ฐ€๋Šฅ์„ฑโ€‹

nvme_uring_cmd_prep์˜ dspec ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ FDP placement handle์„ ์ „๋‹ฌํ•˜๋Š” ํ•„๋“œ:

cdw13 = (dspec as u32) << 16 โ† FDP placement_id ์—ฌ๊ธฐ

FDP placement_id ๊ฒฐ์ • ๋กœ์ง์ด ์™„์„ฑ๋˜๋ฉด, ์ด dspec์— ๋„˜๊ธฐ๋Š” ๊ฒƒ์œผ๋กœ FDP ํ™œ์„ฑํ™”.
#3274๊ฐ€ ๊ทธ ๋ฐฐ๊ด€์„ ์™„์„ฑํ•˜๋Š” PR์ด๋‹ค.

ํ˜„์žฌ PR:
nvme_uring_cmd_prep(... dspec=0) โ† placement ๋ฏธ์ง€์ •

FDP ๋‹ค์Œ ๋‹จ๊ณ„:
nvme_uring_cmd_prep(... dspec=placement_id)
โ†‘
RUH ๋ฒˆํ˜ธ ์ง์ ‘ ์ง€์ • โ†’ WAF โ†“

ํ˜„์žฌ ์ œ์•ฝ ๋ฐ ๋ฆฌ๋ทฐ ํ”ผ๋“œ๋ฐฑโ€‹

ํ•ญ๋ชฉ๋‚ด์šฉ
fixed buffer + uring_cmd ์กฐํ•ฉ์•„์ง ๋ฏธ๊ตฌํ˜„ (PR ๋ณธ๋ฌธ ๋ช…์‹œ)
์ •๋ ฌ ๊ฒ€์ฆ ๋กœ์ง์˜ค์ •๋ ฌ ๋ฐ”์ดํŠธ ๋ฒ”์œ„ ๊ฒ€์ฆ ๊ฐœ์„  ์š”์ฒญ (DongDongJu ์ฝ”๋ฉ˜ํŠธ)
๋น„์ •๋ ฌ I/O์ง€์› ์•ˆ ํ•จ (๋ธ”๋ก ์ •๋ ฌ ์ „์†ก๋งŒ)
--use-uring-cmd UX--use-uring ์—†์ด ๋‹จ๋… ์‚ฌ์šฉ ์‹œ ์˜คํ•ด ์†Œ์ง€ ์žˆ๋Š” ์—๋Ÿฌ ๋ฉ”์‹œ์ง€

๊ด€๋ จ ํŽ˜์ด์ง€โ€‹

  • [[io_uring]] โ€” io_uring_cmd ๊ฐœ๋…, big SQE/CQE ๊ตฌ์กฐ ์ƒ์„ธ
  • [[raw_block-์ข…๋‹จ-๋ถ„์„]] โ€” raw_block ์ „๊ณ„์ธต ๋ถ„์„, FDP ์‚ฝ์ž… ํฌ์ธํŠธ H1-H8
  • [[NVMe-FDP]] โ€” Placement Handle, RUH โ€” dspec์— ๋„ฃ์„ ๋Œ€์ƒ
  • [[๊ธฐ์—ฌ-ํฌ์ธํŠธ-๋งต]] โ€” [2][8] ๊ธฐ์—ฌ ํฌ์ธํŠธ
  • [[PR-3274-IoUring-NVMe]] โ€” ์ด PR ์ถ”์  ํŽ˜์ด์ง€ (ํ˜„์žฌ ์ƒํƒœ, M1/M2 ์ฐฉ์ˆ˜ ์กฐ๊ฑด)