vLLM 논문: Efficient Memory Management for Large Language Model Serving with PagedAttention
·
Paper
Link Efficient Memory Management for Large Language Model Serving with PagedAttentionHigh throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. Whearxiv.org Ⅰ Summary (Introduction)배경: Throughput 증가와 me..