vLLM is a high-throughput and memory-efficient inferencing and serving engine for large language models (LLMs) that’s particularly effective thanks to its implementation of continuous batching for incoming requests. It supports several deployment options, including deployment of distributed tensor-parallel inference and serving with Ray run time, deployment in the cloud with SkyPilot and deployment with NVIDIA Triton, Docker and LangChain. Our teams have had good experience running dockerized vLLM workers in an on-prem virtual machine, integrating with OpenAI compatible API server -— which, in turn, is leveraged by a range of applications, including IDE plugins for coding assistance and chatbots. Our teams leverage vLLM for running models such as CodeLlama 70B, CodeLlama 7B and Mixtral. Also notable is the engine’s scaling capability: it only takes a couple of config changes to go from running a 7B to a 70B model. If you’re looking to productionize LLMs, vLLM is worth exploring.