Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Published : Apr 03, 2024
Apr 2024
Assess ? Worth exploring with the goal of understanding how it will affect your enterprise.

vLLM is a high-throughput and memory-efficient inferencing and serving engine for large language models (LLMs) that’s particularly effective thanks to its implementation of continuous batching for incoming requests. It supports several deployment options, including deployment of distributed tensor-parallel inference and serving with Ray run time, deployment in the cloud with SkyPilot and deployment with NVIDIA Triton, Docker and LangChain. Our teams have had good experience running dockerized vLLM workers in an on-prem virtual machine, integrating with OpenAI compatible API server -— which, in turn, is leveraged by a range of applications, including IDE plugins for coding assistance and chatbots. Our teams leverage vLLM for running models such as CodeLlama 70B, CodeLlama 7B and Mixtral. Also notable is the engine’s scaling capability: it only takes a couple of config changes to go from running a 7B to a 70B model. If you’re looking to productionize LLMs, vLLM is worth exploring.

Download the PDF

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

Subscribe now

Visit our archive to read previous volumes