LLM serving
Large language models (LLMs) have demonstrated remarkable capabilieties in various natural language processing tasks. However, deploying these models for real-world applications can be challenging due to their high computational requirements and the need for efficient serving infrastructure.
On this note, we summarize two popular inference engines for LLMs: vLLM and SGLang.
Milestone models
xxx
Main challenges
xxx
Popular optimizations
Inference engine
vLLM
vLLM is a high-performance inference engine for LLMs. It is optimized for throughput and employs a lot of advanced features (e.g., PagedAttention, Continuous batching, Speculative decoding).
SGLang
SGLang is throughput-optimized inference engine for LLMs. Especially on long context tasks, SGLang can achieve 5x throughput improvement over vLLM.
Best practices
xxx