MP 모드 vs Non-MP 모드

변경 검증 가이드 (다음 fetch 후):
git log eaa2bfee..HEAD -- lmcache/v1/distributed/l2_adapters/raw_block_l2_adapter.py lmcache/v1/storage_backend/abstract_backend.py lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py docs/source/mp/ docs/source/getting_started/quickstart.rst
raw_block_l2_adapter.py:147-150 의 per_tp_device_paths MP 거부가 풀리면 §4·§5·§6 의 "MP 는 거부" 결론이 무효 → FDP Phase 0 단일 디바이스 가정도 재검토.

StoragePluginInterface 위치/시그니처 변화 시 §4 인터페이스 비교 표 갱신.

공식 문서에서 Non-MP 가 deprecated 로 격상되면 §2 "권장 = MP" 근거 갱신.

원문:

docs/source/mp/index.rst (사용자 관점 MP 소개)
docs/source/mp/architecture.rst (개발자 관점 MP 구조)
docs/source/developer_guide/extending_lmcache/storage_plugins.rst (Non-MP plugin)
docs/design/v1/distributed/l2_adapters/overall.md (MP L2 adapter)
docs/design/v1/distributed/l2_adapters/raw_block.md

1. 한 줄 정의

모드	공식 명칭	한 줄
MP (Multi-Process)	Multiprocess (MP) mode	LMCache 가 별도 프로세스(서버) 로 떠 있고, vLLM 들이 ZMQ 로 접속해서 캐시 요청을 보내는 형태
Non-MP	In-process mode	LMCache 가 vLLM 프로세스 안 라이브러리 로 동작 (vLLM 이 `import lmcache` 해서 직접 호출)

"MP" = "별도 프로세스로 떠 있는 캐시 서버 모드" "Non-MP" = "vLLM 안에 라이브러리로 임베드된 모드 (=legacy)"

공식 명명 출처: docs/source/getting_started/quickstart.rst:21-29 에서 두 모드를 각각 "Multiprocess (MP) mode" / "In-process mode" 로 정의.

2. 왜 두 모드가 존재하는가

LMCache 는 처음에 vLLM 안 라이브러리 (Non-MP) 로 시작했다. 그러다 다음 문제가 드러나서 MP 모드가 도입됐다 (docs/source/mp/index.rst):

vLLM 인퍼런스 스레드와 LMCache 의 Python GIL/CPU 작업 (해싱, 메모리 관리, L2 I/O) 이 같은 프로세스 안에서 GIL 경쟁 → 인퍼런스 latency 영향
노드 위 vLLM pod 가 여러 개일 때 L1 캐시를 공유 못 함 (각자 자기 프로세스 안에 들고 있음)
LMCache 쪽 버그가 나면 vLLM 프로세스가 같이 죽음
CPU 메모리(캐시) 와 GPU 메모리(인퍼런스) 자원 스케일이 묶여 있음

→ LMCache 를 별도 프로세스로 분리하고, vLLM 들이 ZMQ 로 붙는 구조 (= MP 모드) 가 됐다.

현재 권장 = MP 모드. Non-MP 는 legacy 호환 + 단일 프로세스 시나리오용으로 남아 있다.

근거 (공식 문서):

docs/source/getting_started/quickstart.rst:23 — "Multiprocess (MP) mode -- recommended. ... Scales better, exposes management/observability endpoints, and supports sharing one cache across multiple engine instances."
docs/source/getting_started/quickstart.rst:27-29 — "In-process mode ... Single command, convenient for quick single-node experiments." (= 빠른 실험용으로 위치 부여, deprecated 는 아님)
docs/source/mp/index.rst:43-44 — 서버 entry point 표에서 lmcache server 옆에 "Recommended."
docs/source/mp/index.rst:9-24 — MP 의 Key Benefits (process isolation / GIL 분리 / pod 간 캐시 공유 / 독립 스케일) 가 권장 근거

3. 프로세스 / 통신 구조 비교

Non-MP (legacy, vLLM 임베드)

┌────────────────────────── vLLM 프로세스 ──────────────────────────┐
│                                                                  │
│  vLLM 인퍼런스 코드                                               │
│      │                                                           │
│      │ Python 함수 호출 (in-process)                              │
│      ▼                                                           │
│  LMCacheEngine                                                   │
│  ├─ L1 (CPU memory)                                              │
│  └─ StorageBackendInterface                                      │
│        └─ StoragePluginInterface (Mooncake, S3, Rust raw_block …) │
└──────────────────────────────────────────────────────────────────┘
                       │
                       ▼ (옵션) Local SSD / Remote KV store

통신: 함수 호출 (같은 Python 프로세스)
진입 인터페이스: StorageBackendInterface / StoragePluginInterface
비동기 모델: asyncio event loop (vLLM 이 깔아 둠 → plugin 은 run_coroutine_threadsafe)

MP (현재 권장, 별도 서버)

┌── vLLM Pod 1 ──┐  ┌── vLLM Pod 2 ──┐  ┌── vLLM Pod N ──┐
│  vLLM client   │  │  vLLM client   │  │  vLLM client   │
└────────┬───────┘  └────────┬───────┘  └────────┬───────┘
         │                   │                   │
         └─── ZMQ (DEALER/ROUTER, tcp) ──────────┘
                             │
                             ▼
        ┌────────────────────────────────────────────┐
        │         lmcache server (별도 프로세스)       │
        │                                            │
        │  MessageQueueServer  (mq.py)               │
        │       │                                    │
        │       ▼                                    │
        │  MPCacheEngine       (server.py)           │
        │       │                                    │
        │       ▼                                    │
        │  StorageManager  (distributed/)            │
        │  ├─ L1Manager  (CPU memory + TTL lock)     │
        │  ├─ StoreController     ─┐                 │
        │  ├─ PrefetchController  ─┤                 │
        │  └─ EvictionController  ─┘                 │
        │             │                              │
        │             ▼                              │
        │  L2AdapterInterface (raw_block, dax,       │
        │   nixl, s3, plugin, native_plugin …)       │
        └────────────────────────────────────────────┘
                             │
                             ▼ Local SSD / Remote KV store

통신: ZMQ DEALER/ROUTER (tcp). RequestType enum 으로 명령 dispatch.
진입 인터페이스: L2AdapterInterface (architecture.rst:29-37)
비동기 모델: eventfd 3개 + select.poll (controller 가 직접 관리)
서버 진입점 3종: lmcache server (권장, ZMQ + FastAPI), python -m lmcache.v1.multiprocess.server (legacy ZMQ-only), blend_server_v2 (CacheBlend)

4. 어떤 인터페이스를 쓰는가 (★ 우리가 짤 백엔드 입장)

	Non-MP	MP
추상 클래스	`StoragePluginInterface` (← `StorageBackendInterface`)	`L2AdapterInterface`
위치	`lmcache/v1/storage_backend/abstract_backend.py:424`	`lmcache/v1/distributed/l2_adapters/base.py`
메서드 형식	async coroutine 위주 (`batched_async_contains`, `batched_get_blocking` 등) + prefix-only get/contains	`submit_` (non-blocking) + `pop_`/`query_*` (one-shot) + 3개 eventfd
비동기 트리거	asyncio loop (vLLM 이 제공)	eventfd → controller `select.poll` 깨움
등록 방법	`extra_config.storage_plugin.<name>.{module_path,class_name}` (yaml/dict)	`--l2-adapter` JSON (`type`, `module_path`, `class_name`, `adapter_params`) + self-register `register_l2_adapter_*`
동시성	asyncio task	controller 2개 (Store/Prefetch) 가 같은 인스턴스에 동시 호출 → plugin 이 thread-safe 해야 함
에러 모델	per-key 결과	Store 는 coarse (전체 success/fail bool), Lookup/Load 는 fine (Bitmap)
락 책임	없음 (asyncio 단일 스레드)	L2-side lock refcount 직접 구현 필요 (lookup-and-lock → load → unlock)
데이터 분포	prefix-only contains/get	bitmap-based 임의 패턴

StoragePluginInterface = In-process 전용 근거:

docs/source/developer_guide/extending_lmcache/storage_plugins.rst:60-62 — "The storage plugin system described above applies to non-MP mode (single-process). For MP mode (multiprocess), LMCache provides the plugin L2 adapter type ..." — 즉 두 인터페이스가 모드별로 갈라져 있다고 공식 문서에 명시.
lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py:75-83 — 클래스 docstring 자체가 "Legacy raw-block storage plugin wrapper. ... preserves the existing non-MP interface and prefix semantics" 라고 박아 둠.
lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py:100-101 — if self.loop is None: raise ValueError("RustRawBlockBackend requires an asyncio event loop") — asyncio loop 가 필수. MP 서버 자체는 asyncio loop 가 아니라 eventfd+poll 로 동작 → 이 백엔드는 vLLM 이 깔아 둔 loop 위에서만 사용 가능.

같은 디바이스/스토리지를 두 모드 다 지원하려는 표준 패턴: 공유 core 를 따로 두고, 두 wrapper 를 그 위에 얹는다. raw_block 가 정확히 이 패턴 (raw_block.md:46-64):

                  RawBlockCore
                  (durable I/O, slot 할당, lock refcount, checkpoint)
                       ▲                          ▲
                       │                          │
        ┌──────────────┘                          └──────────────┐
        │                                                        │
RustRawBlockBackend                                  RawBlockL2Adapter
(StoragePluginInterface)                            (L2AdapterInterface)
   = Non-MP wrapper                                    = MP wrapper
   prefix-only get/contains                            non-blocking + eventfd
   asyncio.run_coroutine_threadsafe                    ThreadPoolExecutor 3개

우리(FDP plugin) 입장에서 결론:

MP 만 짠다 (L2AdapterInterface + plugin 또는 native_plugin 타입). Non-MP 는 legacy 라 깊이 들어갈 필요 없음.
그래도 공유 core 패턴은 베껴서 쓸 가치가 있다 — 나중에 Non-MP 호환을 요구받으면 wrapper 만 추가하면 됨.

5. 핵심 차이 요약 표

항목	Non-MP (legacy)	MP (권장)
프로세스	vLLM 안 임베드	별도 서버 (`lmcache server`)
통신	in-process 함수 호출	ZMQ tcp (DEALER/ROUTER)
L1 캐시 공유	vLLM pod 별로 따로	노드 안 모든 pod 가 공유
GIL 경쟁	있음 (vLLM 과 같은 GIL)	없음 (다른 프로세스)
추상 클래스	`StoragePluginInterface`	`L2AdapterInterface`
비동기 모델	asyncio	eventfd + poll
`per_tp_device_paths`	지원 (TP rank → device 매핑)	거부 (raw_block_l2_adapter.py:147-150)
서버 entry point	(없음 — vLLM 안에서 import)	`lmcache server` / `python -m lmcache.v1.multiprocess.server` / `blend_server_v2`
우리 백엔드 들어갈 자리	(선택) `RustRawBlockBackend` 같은 wrapper	(필수) `L2AdapterInterface` 구현 — `type: "plugin"`
향후 권장	유지보수 모드	활성 개발

6. FDP plugin 짤 때 의미하는 것

목표 모드 = MP only. Non-MP 호환은 후순위. 우선 L2AdapterInterface 만 구현.
우리가 신경 써야 할 것:
- eventfd 3개 (store/lookup/load) — lmcache.v1.platform.create_event_notifier() 로 생성
- submit_* 은 non-blocking, 결과는 pop_*/query_* 로 따로 회수
- thread-safe (StoreController + PrefetchController 동시 호출)
- L2-side lock refcount (lookup_and_lock → load → unlock)
필요 없는 것:
- asyncio event loop (단, plugin 이 내부에서 비동기 처리하고 싶으면 직접 만들어 써도 됨 — framework 가 안 깔아 줌, plugin_pipeline.md 참고)
- vLLM 호환 인터페이스 (Non-MP 의 prefix-only get/contains 등)
TP > 1 시나리오: MP 는 현재 per_tp_device_paths 거부 → TP 별 디바이스 분산 필요하면 MP 쪽 별도 PR 이 선행돼야 함. Phase 0 에서 우리는 단일 디바이스로 시작.

참고

자세한 MP 내부 구조 (ZMQ RequestType enum, MPCacheEngine, StorageManager 책임 분리): docs/source/mp/architecture.rst
L2 adapter 구체 contract: [[l2_adapters_contract]]
Non-MP plugin 구체 contract: docs/source/developer_guide/extending_lmcache/storage_plugins.rst
공유 core 패턴 사례 (raw_block): [[raw_block_line]]

1. 한 줄 정의​

2. 왜 두 모드가 존재하는가​

3. 프로세스 / 통신 구조 비교​

Non-MP (legacy, vLLM 임베드)​

MP (현재 권장, 별도 서버)​

4. 어떤 인터페이스를 쓰는가 (★ 우리가 짤 백엔드 입장)​

5. 핵심 차이 요약 표​

6. FDP plugin 짤 때 의미하는 것​

참고​