FDP 구현 범위 audit — `DongDongJu/LMCache:fdp-waf-agentic-replay-poc`

brand: 7개 카테고리(a~g)에 대해 PoC가 어디까지 구현했는지 표 + 빈자리 + upstream 채택 후보.

1. 카테고리별 구현 매트릭스

#	카테고리	Y/N/부분	위치 (파일:라인)	비고
(a)	placement_id 정책 모듈	Y (per-vLLM-Worker, partial prompt-aware)	`raw_block/core.py:342` `_select_fdp_ruh()` ; `raw_block/core.py:380` `_select_fdp_metadata_ruh()` ; `benchmarks/agentic_mp_trace/replay/fdp_policy.py:18` `resolve_policy()` (storage class별)	per-vLLM-Worker는 코어 레이어. prompt-aware/phase-aware 정책은 harness 레이어 (storage class) 우회 — 하드웨어 단에서는 PoC가 worker 단위 RUH 매핑만 발화
(b)	raw_block backend 통로 (Python ↔ Rust ↔ io_uring_cmd)	Y	Python: `raw_block/core.py:399` `_write_directive_for_ruh()` → `(dtype, dspec)` ; Rust FFI: `rust/raw_block/src/lib.rs:804` `pwrite_from_buffer(... dtype=0, dspec=0)` ; cdw13 주입: `rust/raw_block/src/lib.rs:268` `cmd.cdw13 = (dspec as u32) << 16`	데이터/메타데이터 양쪽 경로 모두 동일 통로 사용. `fdp_directive_type=2` (FDP directive) 기본
(c)	file backend 적용 (`fcntl(F_SET_FILE_RW_HINT)`)	N	`lmcache/v1/distributed/l2_adapters/fs_l2_adapter.py` (PoC 미변경)	완전 미구현. RWH_WRITE_LIFE_* 흔적 없음. 단계 5 `file_backend_fdp.md`가 새로 만들 영역
(d)	NVMe passthrough opcode	부분	identify NS: `rust/raw_block/src/lib.rs:348` `nvme_identify_ns()` ; uring_cmd prep: `rust/raw_block/src/lib.rs` `nvme_uring_cmd_prep()` (cdw12	= dtype, cdw13 = dspec << 16) ; nvme write opcode: `rust/raw_block/src/lib.rs` `cmd.opcode = NVME_IO_WRITE`
(e)	테스트 (unit / integration / 실 디바이스)	Y (3-tier)	unit: `tests/v1/storage_backend/test_rust_raw_block_backend.py:2161` (FDP directive 검증 시나리오 포함) ; integration: `tests/v1/distributed/test_raw_block_l2_adapter.py:153` (`use_fdp`/`fdp_*_ruh_ids` config 파싱 + core 매핑) ; 실 디바이스: `tests/v1/storage_backend/test_rust_raw_block_backend.py:2196` `test_rust_raw_block_backend_fdp_opens_explicit_device` (`LMCACHE_RAW_BLOCK_FDP_TEST_DEVICE`/`LMCACHE_RAW_BLOCK_FDP_RUHS` env로 활성화)	실제 NVMe FDP 디바이스를 environ로 받음. smrc에서 그대로 활용 가능
(f)	measurement harness (WAF/p99 기록 코드)	Y (1500줄+)	`benchmarks/fdp_waf_stress/run_fdp_waf_stress.py` 1456줄 ; `benchmarks/fdp_waf_stress/generate_synthetic_traces.py` 574줄 ; `benchmarks/agentic_mp_trace/replay_region_churn.py` 897줄	mode `mixed`/`separated`/`no_fdp` 3개 비교. trace footprint 분석, byte-window 할당, multi-worker 동시 실행. 단계 4 측정의 그대로 활용 후보. 지표 카탈로그: `04_harness_metrics.md` (산출 파일·summary.json 스키마·PLAN 5지표 가용성 매핑·vendor counter 후보)
(g)	설정/CLI 노출	부분	L2 adapter JSON: `raw_block_l2_adapter.py:101` (`use_fdp`/`fdp_ruh_ids`/`fdp_data_ruh_ids`/`fdp_metadata_ruh_ids`/`fdp_directive_type`/`fdp_metadata_mode`) ; legacy plugin extras: `plugins/rust_raw_block_backend.py:329` (`rust_raw_block.use_fdp` 등) ; CLI flag: 없음 (`lmcache trace replay` 같은 명령어에 `--placement-policy` 미노출)	YAML/JSON config로만 지정. CLI 직접 노출은 미구현

2. 빈자리(gap) 한 줄씩

각 카테고리별, 단계 5(개선 도출)에서 채워야 할 빈자리:

(a) gap: prompt-aware / phase-aware 정책이 harness layer에만 존재 — backend 자체에는 없음. upstream에 가져갈 때 PlacementPolicy 추상 인터페이스로 끌어올릴지, harness가 RUH를 미리 결정해서 메타데이터로 stamp하는 현재 방식을 유지할지 결정 필요.
(b) gap: 통로는 완비. 단지 dev에 있는 raw_block backend(/lmcache/v1/storage_backend/raw_block/core.py 현행) 시그니처와 PoC 시그니처가 다름 (PoC가 _write_directive_for_ruh 인터페이스 추가) → upstream 시 기존 인터페이스에 backward-compat layer 또는 마이그레이션 결정 필요.
(c) gap: file backend 전체. 단계 5 file_backend_fdp.md에서 (A) fcntl(F_SET_FILE_RW_HINT, RWH_WRITE_LIFE_*), (B) per-prompt 파일 분리 + RW_HINT, (C) raw_block과 동일하게 io_uring_cmd 사용 — 3가지 옵션 비교.
(d) gap: fetch_fdp_status 같은 RUH descriptor 자동 조회가 없음. 운영 친화도 낮음 → ankit#1의 fetch_fdp_status 패턴 채택 필요.
(e) gap: 부족 없음. 다만 PoC의 test_rust_raw_block_backend_fdp_opens_explicit_device는 single-key smoke 수준 — load 경로 / put_many 배치 시나리오 / 실패 시나리오 (fdp_directive_type 값 위배 등) 통합 테스트는 부분적.
(f) gap: harness가 subprocess.run(['lmcache', 'trace', 'replay', ...])로 자식 LMCache 인스턴스를 띄우는 방식. vLLM 실 워크로드 측정에는 직접 통합 안 됨 — 단계 4에서 tensormesh harness와 연결할 때 별도 작업 필요.
(g) gap: CLI flag 없음. 단계 5 후속 PR에 lmcache storage configure --use-fdp --data-ruhs 0,1,2,3 --metadata-ruhs 4 같은 노출 검토.

3. 재사용 가능 컴포넌트 (upstream 후보)

PoC에서 거의 그대로 가져갈 수 있는 단위:

단위	파일	후보 PR	의존성
5개 config 필드 (`use_fdp`/`fdp_data_ruh_ids`/`fdp_metadata_ruh_ids`/`fdp_directive_type`/`fdp_metadata_mode`) + 검증	`raw_block/core.py:82-225`, `raw_block_l2_adapter.py:101-260`	PR-A (단계 5 `next_prs.md`)	dev raw_block core가 dataclass config인지 확인 필요. 현재 dev에선 `RawBlockCore` config가 다른 형태일 수 있음
`_write_directive_for_ruh()` 3-함수 추상 (선택 / 매핑 / dtype·dspec 산출)	`raw_block/core.py:342-408`	PR-A	core.py 시그니처 확장: put/put_many에 `placement_hint` 인자
Rust FFI `dtype`/`dspec` 인자 + `validate_nvme_directive`	`rust/raw_block/src/lib.rs:298-313`	PR-A (Rust 부분)	dev에 머지된 io_uring_cmd 인프라(#3274) 위에서 작업
`fetch_fdp_status` (RUH descriptor 자동 조회)	PoC에 없음, ankit#1에서 가져옴	PR-A or PR-B	NVMe identify cmd opcode
Metadata RUH 분리 (`fdp_metadata_mode="per_ruh"`)	`raw_block/core.py:248-251` (`_fdp_metadata_ruh_index`)	PR-A 또는 별도 PR-B	meta_total_bytes 의미 변경 (per RUH 예약) — backward-compat 검토
FDP WAF stress harness (mode 비교: mixed/separated/no_fdp)	`benchmarks/fdp_waf_stress/`	PR-D (별도 benchmark PR)	benchmark 디렉터리 정리 정책 (upstream에서 받을지)
storage_class 단위 RUH 정책 추상 (prompt-aware fallback)	`benchmarks/agentic_mp_trace/replay/fdp_policy.py`	(보류)	harness 의존, upstream 직접 채택 시 storage class 의미 정의 필요

4. 단계 4(측정)에서 활용할 PoC 자산

PoC는 자체 측정 harness를 갖고 있어 단계 4 R0~R2 run을 그대로 돌릴 수 있다.

# PoC harness 사용 (smrc에서 단계 4 시점에)
python -m benchmarks.fdp_waf_stress.run_fdp_waf_stress \
  --config benchmarks/fdp_waf_stress/config.example.yaml \
  --mode no_fdp     # R1 baseline
# ↔
python -m benchmarks.fdp_waf_stress.run_fdp_waf_stress \
  --config benchmarks/fdp_waf_stress/config.example.yaml \
  --mode separated  # R2 FDP on with data/metadata RUH 분리
# ↔
python -m benchmarks.fdp_waf_stress.run_fdp_waf_stress \
  --config benchmarks/fdp_waf_stress/config.example.yaml \
  --mode mixed      # R3 FDP on without metadata 분리 (비교용)

시사점: tensormesh harness와 별개로 PoC 자체 harness가 R1/R2/R3 비교를 한 번에 처리 가능하다. 단계 4의 측정 스크립트를 새로 짜기 전에 이 harness를 우선 시도하는 것이 효율적이다 (단계 4 PLAN 보강 후보).

5. 종합 결론

PoC는 NVIDIA PoC를 위한 self-contained 측정 환경이며, FDP 코어 (a/b/d) + measurement harness (f) + 일부 테스트 (e) + 설정 (g)는 풍부, 그러나 file backend (c)는 손도 안 댔고, NVMe RUH descriptor 자동 조회 (d 일부)는 ankit#1만의 강점.

upstream 작업의 분할 단위 권장 (단계 5 next_prs.md 사전 스케치):

PR-A (대): 5개 config + _select_fdp_ruh/_write_directive_for_ruh + Rust dtype/dspec + validate_nvme_directive — 단일 PR이지만 작업 표면 큼
PR-B: ankit#1의 fetch_fdp_status 자동 조회 통합
PR-C: metadata RUH 분리 (PoC f3a4297a) — WAF 효과 입증 후 별 PR
PR-D: file backend FDP (write hint or io_uring_cmd 공유) — 새 영역
PR-E (선택): FDP WAF stress harness benchmarks/fdp_waf_stress/

각 PR의 우선순위와 실제 단위 분할은 단계 4 측정 결과를 본 뒤 단계 5 next_prs.md에서 확정한다.

1. 카테고리별 구현 매트릭스​

2. 빈자리(gap) 한 줄씩​

3. 재사용 가능 컴포넌트 (upstream 후보)​

4. 단계 4(측정)에서 활용할 PoC 자산​

5. 종합 결론​

1. 카테고리별 구현 매트릭스

2. 빈자리(gap) 한 줄씩

3. 재사용 가능 컴포넌트 (upstream 후보)

4. 단계 4(측정)에서 활용할 PoC 자산

5. 종합 결론