Raw Block 라인 종단 분석 (TODO 4)

분석 대상:

docs/design/v1/distributed/l2_adapters/raw_block.md
lmcache/v1/distributed/l2_adapters/raw_block_l2_adapter.py (770 LOC)
lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py (585 LOC, 비교용)
lmcache/v1/storage_backend/raw_block/core.py (1476 LOC)
lmcache/v1/storage_backend/raw_block/key_codec.py (168 LOC)
rust/raw_block/src/lib.rs (2030 LOC)

L1 — Contract / 종단 인터페이스

계층 구조

flowchart TD
    subgraph MP["MP 모드 (현재 권장 경로)"]
        SC["StoreController<br/>PrefetchController"]
        Adapter["RawBlockL2Adapter<br/>raw_block_l2_adapter.py:281<br/>L2AdapterInterface 구현"]
    end

    subgraph Legacy["Legacy 비-MP 경로 (레거시 facade)"]
        SM["StorageManager"]
        Plugin["RustRawBlockBackend<br/>rust_raw_block_backend.py:74<br/>StoragePluginInterface 구현"]
    end

    Core["RawBlockCore<br/>core.py:152<br/>슬롯 할당 / 인덱스 / 체크포인트 / lock refcount"]

    Rust["RawBlockDevice (Rust PyO3)<br/>rust/raw_block/src/lib.rs:355<br/>POSIX or io_uring 엔진"]

    Dev[("raw block device / file<br/>/dev/nvme*n* or pre-sized file")]

    SC --> Adapter
    SM --> Plugin
    Adapter --> Core
    Plugin --> Core
    Core --> Rust
    Rust --> Dev

책임 분리

레이어	책임	안 하는 일
`RawBlockL2Adapter`	submit/pop/query 비동기 계약, eventfd 3개 (store/lookup/load), ThreadPoolExecutor 3개 (size 2/1/4), task_id 발급, listener 알림	슬롯 할당, 인덱스, 디바이스 I/O
`RustRawBlockBackend` (legacy)	비-MP `StoragePluginInterface`, prefix-only contains/get, asyncio.run_coroutine_threadsafe 로 Core 호출, pin/unpin	(MP 와 같은 eventfd 비동기 계약 없음 — asyncio loop 외부에서 받음)
`RawBlockCore`	디바이스 open/close, in-memory 키 인덱스 (`_index`), free slot list, `_inflight` 추적, lock refcount, 메타데이터 체크포인트 + 복구, slot header 검증	비동기 계약, 워커 스레드 풀 (Adapter 가 함)
`RawBlockDevice` (Rust)	POSIX `pread/pwrite` 또는 `io_uring` 엔진, `register_fixed_buffers`, AlignedBuf (O_DIRECT bounce), 단일 디바이스 fd + 단일 ring + 단일 워커 스레드	슬롯 / 인덱스 / 체크포인트 / 키

MP adapter 진입 / 탈출 invariant (`raw_block.md` + 코드 검증)

계약	위치	비고
eventfd 3개 분리 (store / lookup / load)	adapter:331-333	`create_event_notifier()` 로 생성
submit 은 non-blocking	adapter:407-415 등	`ThreadPoolExecutor.submit` 후 즉시 task_id 반환
결과 회수: pop_completed_, query__result	adapter:417-422, 452-455, 499-502	한 번 꺼내면 dict 에서 제거
L2 lock = `exists_many(.., lock=True)`	core.py:523-547	hit 마다 `_lock_refcnt` 증가, `unlock_many` 가 감소
`delete(force=False)` 는 locked 슬롯 보존	core.py:670-692	MP 의 lookup-and-lock → load → unlock 사이 안전
caller 가 load 목적지 버퍼 제공	adapter:462-497	adapter 는 destination 할당 안 함
close() 순서: pool shutdown → core.close → eventfd close	adapter:535-561	진행 중 task 모두 끝낸 뒤 디바이스 닫음

미충족 / 제약

per_tp_device_paths MP 모드에서 거부됨 (adapter:147-150). non-MP RustRawBlockBackend 만 TP rank 별 디바이스 매핑 사용 (rust_raw_block_backend.py:108-130).
use_odirect=True 면 L1 alignment 가 block_align 이상 필수 (adapter:306-313).
slot_bytes, header_bytes, meta_total_bytes 모두 block_align 배수.
slot_bytes >= header_bytes + 1.

L2 — I/O 경로 (put / get / lookup / evict)

디바이스 레이아웃

0                                                        device_size
├──────── meta_total_bytes (256 MiB 기본) ────────┤
│  meta container 0   │  meta container 1   │   data slots... │
│  (mirror copy)      │  (mirror copy)      │                 │
│  header(4KB) + json │  header(4KB) + json │  slot 0  slot 1 │
└─────────────────────┴─────────────────────┴─────────────────┘
                       ↑                    ↑
                  meta_copy_count=2    data_base_offset
                                       = meta_total_bytes

슬롯 = 고정 크기 slot_bytes (기본 1 MiB), 슬롯 머리 header_bytes (기본 4 KiB) 에 LMCBLK01 magic + 64bit slot_identity + payload_len 기록 (core.py:940-950)
메타데이터: 같은 디바이스 앞쪽에 mirror copy 2개, JSON 직렬화 + zlib CRC32, _meta_seq 증가하며 round-robin (core.py:1161-1198)

put 흐름 (`submit_store_task`)

sequenceDiagram
    participant SC as StoreController
    participant A as RawBlockL2Adapter
    participant P as ThreadPool (rawblk-store)
    participant C as RawBlockCore
    participant R as RawBlockDevice (Rust)
    participant D as 디바이스

    SC->>A: submit_store_task(keys, objs)
    A->>A: task_id 발급 + inflight++
    A->>P: pool.submit(_run_store_task)
    A-->>SC: task_id (즉시 return)

    P->>C: put_many(specs, objs)
    loop 각 (key, obj)
        C->>C: _lock 획득
        C->>C: 이미 indexed/inflight 면 skip
        C->>C: _allocate_slot_locked() → offset
        C->>C: _inflight[encoded] = (offset, meta)
        C->>C: _lock 해제
        C->>R: pwrite_from_buffer(offset, header)
        R->>D: pwrite (POSIX) or SQE Write/WriteFixed (io_uring)
        C->>R: pwrite_from_buffer(offset+header_bytes, payload)
        R->>D: pwrite ...
        C->>C: _lock 획득
        C->>C: _inflight pop → _index 등록
        C->>C: _meta_dirty_total++
    end
    C-->>P: RawBlockPutManyResult
    P->>A: _finish_store_task (callback)
    A->>A: _completed_store_tasks[task_id] = success
    A->>A: _notify_keys_stored (listener)
    A->>A: store_efd.notify()
    SC-->>SC: poll(store_efd) 깨어남
    SC->>A: pop_completed_store_tasks()

핵심 포인트:

header / payload 별도 pwrite 2회 — header 가 먼저 가야 slot identity 가 디스크에 박힘
O_DIRECT 면 header 도 block_align 까지 round-up (core.py:919-924)
payload buffer 가 block_align aligned 이고 enable_zero_copy=True 면 _build_direct_odirect_view 가 ctypes 로 raw memoryview 만들어 zero-copy (core.py:801-854)
_inflight_io_count++ / last_io_ts 갱신 → 체크포인트 idle quiet 판정 입력

get / load 흐름 (`submit_load_task`)

호출자가 destination buffer 를 미리 할당해서 넘긴다. Adapter 는 절대 새 메모리 할당 안 함.

core.load_many_into(encoded_keys, objs):
  with lock: items = [(k, _index.get(k)) for k in encoded_keys]; inflight_io_count++
  for (k, entry) in items:
      if entry is None: continue (miss)
      payload_len = entry.size
      total_len = round_up(payload_len, block_align) if O_DIRECT else payload_len
      direct_view = _build_direct_odirect_view(...)
      raw_dev.pread_into(entry.offset + header_bytes, buf, payload_len, total_len)
      objs[i].metadata.cached_positions = entry.meta.cached_positions
  inflight_io_count--

→ 결과 bitmap 으로 변환 후 load_efd.notify().

lookup-and-lock 흐름

core.exists_many(keys, lock=True):
  with lock:
      for k in keys:
          found = k in _index
          if found and lock: _lock_refcnt[k] += 1

순수 in-memory 조회. 디바이스 I/O 없음. lock=True 가 핵심 — load 가 끝날 때까지 evict 못 하게 보호.

evict / delete 흐름 (`adapter.delete`)

core.delete_many(keys, force=False):
  with lock:
      for k in keys:
          if locked and not force: 보존, return False
          removed = _index.pop(k)
          _inflight[k].canceled = True (있으면 — race 방지)
          _free_slots.append(slot)
          _meta_dirty_total++

슬롯 자체에 즉시 쓰기 X — free list 에 회수만 됨, 다음 put 이 덮어씀
_inflight.canceled = True 표시 → put 워커가 끝날 때 free list 로 되돌림 (core.py:501-507)
force=False 가 기본: lookup-and-lock 중인 슬롯 안전

체크포인트 흐름

_checkpoint_loop (백그라운드 daemon thread):
  every meta_checkpoint_interval_sec:
    _checkpoint_once(force=False):
      if not dirty: skip
      if inflight_io_count > 0 or now - last_io_ts < meta_idle_quiet_ms: skip
      _snapshot_state() → with _lock 잡고 dict 통째로 JSON 직렬화 (대형 인덱스 문제!)
      _write_checkpoint(): pwrite payload → pwrite header → meta_seq++

→ meta_idle_quiet_ms=100ms 동안 I/O 없을 때만 체크포인트. 즉 sustained write 중에는 체크포인트가 안 일어남 → 크래시 시 복구 윈도 길어짐. → 부팅 시 _load_checkpoint_from_device: mirror 2 개 중 seq 큰 쪽 선택, CRC32 검증, apply_loaded_state 로 _index 재구성. → meta_verify_on_load=True 면 슬롯 헤더의 slot_identity 까지 디바이스에서 읽어 일치 검증 (core.py:1407-1450).

L3 — FDP / HC-SSD 삽입 후보 지점

설계 문서가 명시적으로 TODO 로 남긴 항목 (raw_block.md:42-44):

FDP / placement-hint support
A raw NVMe command path

후보 지점 표

#	위치	후크 가능 데이터	추가할 것	영향 범위	우선순위
H1	`RawBlockCore._write_one` (core.py:898-938) → `pwrite_from_buffer` 콜	slot offset, key spec, header content, payload size	placement_id 인자 추가 — Rust 까지 흘려보냄	호스트 → 디바이스 entry point. 단일 지점에서 모든 write 가 통과	★★★
H2	`RawBlockDevice.pwrite_from_buffer` / `batched_write` (Rust lib.rs:1078, 1710)	fd, offset, buf, len	NVMe `IORING_OP_URING_CMD` (passthru) 또는 `RWF_*` write hint 플래그	Rust ↔ kernel 경계. io_uring FDP 패치 (kernel 6.8+) 의존	★★★
H3	`RawBlockCore.put_many` (core.py:434-521) — slot 할당 직후	encoded_key, slot offset, MemoryObj metadata	`cache_salt` / model_name / cached_positions 기반 PLID 분류 정책	정책이 들어가는 곳. hot/cold 결정	★★
H4	`_allocate_slot_locked` (core.py:1004-1013)	next_slot, _free_slots	PLID 별 free list 분리 — RU 경계와 슬롯 그룹 정렬	슬롯 → RU 매핑 안정성 (회수 시 같은 RU 로 가도록)	★★
H5	metadata checkpoint write (`_write_checkpoint`, core.py:1161-1198)	매우 긴 수명 데이터 (수십 분 ~ 영구)	별도 PLID (가장 긴 lifetime PLID) 로 고정	WAF 안정. 메타 영역 GC 가 데이터 슬롯 GC 와 섞이지 않음	★★★
H6	`delete_many` (core.py:653-692) → free slot 회수	encoded_key, offset, slot 인덱스	`IORING_OP_URING_CMD` 으로 `dataset management (DSM) deallocate` 또는 FDP RU reset hint	디바이스에 회수 알림. WAF 추가 절감	★
H7	`RustRawBlockBackend._build_core_config` / `RawBlockL2AdapterConfig`	사용자 JSON config	`fdp_plid_*` 필드, `placement_strategy` enum	config surface	★
H8	`register_fixed_buffers` (Rust lib.rs:1017)	호스트 메모리 풀 시작 주소	등록된 버퍼당 PLID 메타데이터 — `WriteFixed` 시 매핑된 PLID 자동 적용	핫패스 zero-copy 와 PLID 자동 부여 동시	★ (P2)

HC-SSD (대용량) 관점에서 별도로 봐야 하는 곳

#	문제	위치	메모
C1	키 인덱스 `dict[str, _Entry]` 가 in-memory 전체 보관	core.py:241	수십 TB / 수백만 슬롯 시 RAM 압박. 인덱스 오프로드 필요
C2	`_snapshot_state` 가 `_lock` 잡고 전체 dict 순회 + JSON 직렬화	core.py:1102-1145	인덱스 크면 lock hold 시간 폭발. 동시 I/O latency 스파이크
C3	`_free_slots: list[int]` 선형 — `slot in self._free_slots` 멤버십 체크 O(n) (core.py:1019)	core.py:246, 1015-1021	대용량에서 free list 길어지면 delete 핫패스 저하
C4	meta_total_bytes 기본 256 MiB 고정	adapter.py:77	대용량에서 부족 가능. 용량 비례 자동 산정 필요
C5	`meta_idle_quiet_ms=100` — sustained write 중 체크포인트 안 됨	core.py:1200-1211	crash window 길어짐. 인덱스 fsync 별도 경로 고려
C6	단일 ring + 단일 worker (Rust 단계)	lib.rs:439, 479	TB급 throughput 한계. 멀티 ring / NUMA 친화 스케줄 필요

정책 설계 시 데이터 분류 후보

데이터	수명	추정 PLID 그룹	근거
메타데이터 체크포인트	매우 길다 (영구)	별도 PLID 0	mirror copy 2개, 매 60s, 거의 안 지워짐
`cache_salt` 가 같은 KV 슬롯 묶음	같은 요청 cluster	같은 PLID	함께 쓰이고 함께 만료될 가능성
자주 hit 되는 핫 슬롯	길다	hot PLID	listener 의 `on_l2_keys_accessed` 빈도 기반 (adapter:728)
한 번 쓰고 곧 evict 되는 슬롯	짧다	cold PLID	LRU score 기반

cache_salt 별 분류는 이미 adapter._bytes_by_cache_salt 회계에 사용 중이라 (adapter:597-605) 정책 입력으로 자연스럽게 끌어 쓸 수 있음.

L4 — io_uring 사용 분석

출처: Explore agent 분석 (rust/raw_block/src/lib.rs 12점 점검).

엔진 / ring 구조

항목	값	위치
엔진 선택	`io_engine` 문자열 ("posix" / "io_uring") + legacy `use_iouring` bool	lib.rs:50-61
ring 인스턴스	1 디바이스 = 1 ring, `Arc<Mutex>`	lib.rs:439, 441
워커 스레드	1 디바이스 = 1 dedicated worker thread	lib.rs:479
queue depth	`iouring_queue_depth` (기본 256), `IoUring::new(qd)` 그대로	lib.rs:48, 439, 976
setup flags	없음 (default)	lib.rs:439

→ setup_iopoll, setup_sqpoll, setup_single_issuer, setup_coop_taskrun, setup_defer_taskrun 전부 활성화 안 됨. kernel default 로 동작.

Op / 제출 패턴

항목	값
사용 op	`Read`, `Write`, `ReadFixed`, `WriteFixed` (4종)
Vectored I/O	없음 (Readv/Writev 미사용)
Fsync	없음
배치	워커가 큐에서 사용 가능한 만큼 SQE 일괄 제출 후 `submit()` (`submit_and_wait` 미사용)
완료 대기	Condvar + 10µs timeout 폴링 (busy-wait 아님, 협력적 양보)
매칭	submission 마다 `user_data` 부여 → CQE `user_data()` 로 inflight HashMap 조회
짧은 I/O 처리	bytes_transferred < len 이면 offset/len 조정해서 재제출 (lib.rs:505-585)

Registered buffers / fixed file

항목	값
`IORING_REGISTER_BUFFERS`	호출됨 (`register_fixed_buffers` Python 메서드, lib.rs:1017-1069)
`IORING_REGISTER_FILES`	미호출
`WriteFixed` / `ReadFixed` 사용	등록된 buffer pointer 일치 시 자동 사용 (lib.rs:544-550)
등록 lifecycle	close 시 `unregister_buffers()` (lib.rs:1990)
활용도	opt-in — Python 측에서 `register_fixed_buffers` 명시 호출 안 하면 일반 Read/Write 만 사용

→ 현재 LMCache 에서 register_fixed_buffers 가 실제로 호출되는지 별도 확인 필요. RawBlockCore._rawdev() 에서는 호출 안 함 (core.py:280-299). 즉 fixed buffer 인프라는 있지만 사용 안 됨 → P1 개선 후보.

O_DIRECT / alignment

AlignedBuf (lib.rs:170-222) — posix_memalign 으로 정렬 버퍼 할당
동기 경로 (pwrite_from_buffer, pread_into) 는 비정렬 시 bounce buffer fallback
batched 경로 (batched_write/batched_read) 는 bounce buffer 안 씀 — 정렬 안 맞으면 즉시 ValueError (lib.rs:1631) → batched path 쓰려면 호스트에서 정렬 보장 필수

NVMe passthru / ioctl

현재 BLKGETSIZE64 ioctl 만 사용 (디바이스 크기 조회, lib.rs:153)
IORING_OP_URING_CMD 등 NVMe passthru 경로 없음 ← 우리가 FDP 추가할 핵심 자리

LMCache 측 io_uring 활용 현황 요약

측정 항목	답
어디서 ring 만드는가	`RawBlockDevice.__init__` (Rust) — 디바이스 1개당 1개
queue depth 결정	adapter config `iouring_queue_depth` (기본 256) → 그대로 ring `entries`
배치 정책	워커가 in-flight 큐 drain 후 사용 가능 SQE 만큼 일괄 submit
op 종류	Read / Write / ReadFixed / WriteFixed
registered buffer	코드 존재, 현재 LMCache 코드 경로에서는 미사용
per-TP device sharding	non-MP `RustRawBlockBackend` 만 지원 (TP rank → device path), MP 는 거부

다음 단계 후보 (코드 수정 시작 시점에 결정)

(검증) register_fixed_buffers 실제 호출 여부 grep + 호출되지 않는다면 그 이유 (lifetime 문제? L1 메모리 풀과의 alignment 불일치?). → lmcache/v1/memory_management.py 확인 필요 (TODO §11-3, §11-4 항목과 합쳐서)
(설계) plugin skeleton 작성: RawBlockL2Adapter 의 ThreadPoolExecutor 3개 + eventfd 3개 패턴을 그대로 베껴 FDP plugin 베이스로 사용
(설계) H1 (core _write_one 에 placement_id) + H2 (Rust pwrite/batched_write 시그니처에 hint) 두 자리에 PLID 흘리는 시그니처 변경의 최소 패치 작성
(측정) 현재 baseline: iouring_queue_depth=256, num_store_workers=2, num_load_workers=4 에서 단일 디바이스 throughput / latency / WAF 측정 — Phase 0 기준값으로 문서화

Open Questions

register_fixed_buffers 가 LMCache 어딘가에서 호출되는가? (코드에 인프라는 있으나 호출자 미확인)
L1 memory pool 의 data_ptr 가 FDP 적용 시 PLID 별로 분리된 풀이어야 하는가, 아니면 슬롯 단위로 PLID 인자만 다르게 흘려도 되는가?
_snapshot_state 의 lock hold 시간이 인덱스 100만 entry 시 얼마인지 — HC-SSD 구간에서 실측 필요
_free_slots: list[int] 의 멤버십 체크 O(n) 가 실측에서 문제 되는 시점 (free slot 수)
meta_total_bytes 기본 256 MiB 가 실제 인덱스 페이로드(JSON 직렬화) 보다 항상 충분한지 — 키 길이 × 엔트리 수 로 추정해 보기

참고

사전 메모 raw-block-perf-findings.md 의 Rust 7개 / Python 7개 / L2 adapter 2개 개선 항목 중, L4 분석으로 추가 확인된 것: fixed buffer 인프라가 미사용 상태일 가능성 (P1). 그 외 항목은 라인 분석으로 모두 위치 재확인됨.

L1 — Contract / 종단 인터페이스​

계층 구조​

책임 분리​

MP adapter 진입 / 탈출 invariant (raw_block.md + 코드 검증)​

미충족 / 제약​

L2 — I/O 경로 (put / get / lookup / evict)​

디바이스 레이아웃​

put 흐름 (submit_store_task)​

get / load 흐름 (submit_load_task)​

lookup-and-lock 흐름​

evict / delete 흐름 (adapter.delete)​

체크포인트 흐름​

L3 — FDP / HC-SSD 삽입 후보 지점​

후보 지점 표​

HC-SSD (대용량) 관점에서 별도로 봐야 하는 곳​

정책 설계 시 데이터 분류 후보​

L4 — io_uring 사용 분석​

엔진 / ring 구조​

Op / 제출 패턴​

Registered buffers / fixed file​

O_DIRECT / alignment​

NVMe passthru / ioctl​

LMCache 측 io_uring 활용 현황 요약​

다음 단계 후보 (코드 수정 시작 시점에 결정)​

Open Questions​

참고​