Back to Resources

vLLM

LLMs

High-throughput, memory-efficient LLM serving engine.

45kstars7kforksPython

About

vLLM is designed for fast LLM inference and serving using PagedAttention. Widely used for production-scale deployments with an OpenAI-compatible API and support for multi-GPU setups.

Key Features

  • PagedAttention
  • Fast inference
  • OpenAI-compatible API
  • Multi-GPU

Tags

InferenceServingPerformanceProduction

Related Resources