LLM serving

Large language models (LLMs) have demonstrated remarkable capabilieties in various natural language processing tasks. However, deploying these models for real-world applications can be challenging due to their high computational requirements and the need for efficient serving infrastructure.

On this note, we summarize two popular inference engines for LLMs: vLLM and SGLang.

Milestone models

xxx

Main challenges

xxx

Inference engine

vLLM

vLLM is a high-performance inference engine for LLMs. It is optimized for throughput and employs a lot of advanced features (e.g., PagedAttention, Continuous batching, Speculative decoding).

SGLang

SGLang is throughput-optimized inference engine for LLMs. Especially on long context tasks, SGLang can achieve 5x throughput improvement over vLLM.

Best practices

xxx