vMLX is an open source inference engine for Mac built on Apple's MLX framework. Prefix caching, paged KV cache, continuous batching, and MCP tool integration — features the others don't have.
Built from the ground up for Apple Silicon, with advanced inference capabilities that competing engines simply don't offer.
Reuse previously computed prefill tokens across conversations. Dramatically reduce time-to-first-token for repeated system prompts and context windows.
Memory-efficient key-value caching with configurable block sizes. Handle longer contexts without running out of unified memory on your Mac.
Serve multiple concurrent requests with intelligent batch scheduling. Maximize throughput with up to 256 concurrent sequences.
Native Model Context Protocol support. Connect your models to external tools, APIs, and data sources for agentic workflows.
Drop-in replacement for OpenAI's chat completions endpoint. Use your existing tools, scripts, and integrations without changing a line of code.
Save, restore, and manage conversation sessions. Switch between models and contexts without losing your chat history.
Optimized for Apple Silicon's unified memory architecture. Every token, every batch, every cache hit — tuned for maximum throughput.
Download the latest release, pick a model from HuggingFace, and start inferencing. vMLX handles model management, server lifecycle, and API exposure automatically.
Fine-tune every aspect of the inference pipeline. From prefill batch sizes to cache memory allocation — vMLX gives you the knobs that other engines hide or simply don't have.
vMLX is free, open source, and built for the Mac you already have.
No cloud. No API keys. No rate limits. Just inference.